Cross-lingual dysphonic speech detection using pretrained speaker embeddings

In this study, cross-lingual binary classification and severity estimation of dysphonic speech have been carried out. Hand-crafted acoustic feature extraction is replaced by the speaker embedding techniques used in the speaker verification. Two state of art deep learning methods for speaker verifica...

Teljes leírás

Elmentve itt :

Bibliográfiai részletek
Szerzők:	Aziz Dosti Ali Hama Salih Sztahó Dávid
Testületi szerző:	Magyar számítógépes nyelvészeti konferencia (19.)
Dokumentumtípus:	Könyv része
Megjelent:	2023
Sorozat:	Magyar Számítógépes Nyelvészeti Konferencia 19
Kulcsszavak:	Nyelvészet - számítógép alkalmazása
Tárgyszavak:	Természettudományok Számítás- és információtudomány
Online Access:	http://acta.bibl.u-szeged.hu/78412

Leíró adatok
Tartalmi kivonat:	In this study, cross-lingual binary classification and severity estimation of dysphonic speech have been carried out. Hand-crafted acoustic feature extraction is replaced by the speaker embedding techniques used in the speaker verification. Two state of art deep learning methods for speaker verification have been used: the X-vector and ECAPA-TDNN. Embeddings are extracted from speech samples in Hungarian and Dutch languages and used to train Support Vector Machine (SVM) and Support Vector Regressor (SVR) for binary classification and severity estimation, in a cross-language manner. Our results were competitive with manual feature engineering, when the models were trained on Hungarian samples and evaluated on Dutch samples in the binary classification of dysphonic speech and outperformed in estimating the severity level of dysphonic speech. Moreover, our model achieved 0.769 and 0.771 in Spearman and Pearson correlations. Also, our results in both classification and regression were superior compared to manual feature extraction technique when models were trained on Dutch samples and evaluated on Hungarian samples with only a limited number of samples are available for training. An accuracy of 86.8% was reached with features extracted from embedding methods, while the maximum accuracy using hand-crafted acoustic features was 66.8%. Overall results show that Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) performs better than the former X-vector in both tasks.
Terjedelem/Fizikai jellemzők:	171-183
ISBN:	978-963-306-912-7

Cross-lingual dysphonic speech detection using pretrained speaker embeddings

Hasonló tételek