Conformer-based Ultrasound-to-Speech Conversion
Deep neural networks have shown promising potential for ultrasound-to-speech conversion task towards Silent Speech Interfaces. In this work, we applied two Conformer-based DNN architectures (Base and one with bi-LSTM) for this task. Speaker-specific models were trained on the data of four speakers f...
Elmentve itt :
| Szerzők: | |
|---|---|
| Dokumentumtípus: | Könyv része |
| Megjelent: |
International Speech Communication Association (ISCA)
Dublin
2025
|
| Sorozat: | Interspeech
Annual Conference of the International Speech Communication Association, INTERSPEECH 2025 |
| Tárgyszavak: | |
| doi: | 10.21437/Interspeech.2025-2147 |
| mtmt: | 36397180 |
| Online Access: | http://publicatio.bibl.u-szeged.hu/39052 |
| Tartalmi kivonat: | Deep neural networks have shown promising potential for ultrasound-to-speech conversion task towards Silent Speech Interfaces. In this work, we applied two Conformer-based DNN architectures (Base and one with bi-LSTM) for this task. Speaker-specific models were trained on the data of four speakers from the Ultrasuite-Tal80 dataset, while the generated mel spectrograms were synthesized to audio waveform using a HiFi-GAN vocoder. Compared to a standard 2D-CNN baseline, objective measurements (MSE and mel cepstral distortion) showed no statistically significant improvement for either model. However, a MUSHRA listening test revealed that Conformer with bi-LSTM provided better perceptual quality, while Conformer Base matched the performance of the baseline along with a 3× faster training time due to its simpler architecture. These findings suggest that Conformer-based models, especially the Conformer with bi-LSTM, offer a promising alternative to CNNs for ultrasound-to-speech conversion. © 2025 Elsevier B.V., All rights reserved. |
|---|---|
| Terjedelem/Fizikai jellemzők: | 5 5578-5582 |