Smooth inverse frequency based text data selection for medical dictation

Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance meas...

Teljes leírás

Elmentve itt :
Bibliográfiai részletek
Szerzők: Bálint Domonkos
Mihajlik Péter
Testületi szerző: Magyar számítógépes nyelvészeti konferencia (17.) (2021) (Szeged)
Dokumentumtípus: Könyv része
Megjelent: 2021
Sorozat:Magyar Számítógépes Nyelvészeti Konferencia 17
Kulcsszavak:Nyelvészet - számítógép alkalmazása
Tárgyszavak:
Online Access:http://acta.bibl.u-szeged.hu/73371
Leíró adatok
Tartalmi kivonat:Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance measurement to filter public domain web corpora. The selection for (medical) domain matching documents can be scaled. The resulted text is used to train an augmented language model for a medical dictation system. We show that using the appropriately scaled selection leads to optimal performance of the ASR system over the baselines where no data augmentation was applied or all the augmentation data was added.
Terjedelem/Fizikai jellemzők:233-242
ISBN:978-963-306-781-9