ASR - VLSP 2021: Semi-supervised Ensemble Model for Vietnamese Automatic Speech Recognition

Pham Viet Thanh; Le Duc Cuong; Dao Dang Huy; Luu Duc Thanh; Nguyen Duc Tan; Dang Trung Duc Anh; Nguyen Thi Thu Trang

doi:10.25073/2588-1086/vnucsce.332

Pham Viet Thanh, Le Duc Cuong, Dao Dang Huy, Luu Duc Thanh, Nguyen Duc Tan, Dang Trung Duc Anh, Nguyen Thi Thu Trang

PDF

Published Jun 30, 2022

DOI: https://doi.org/10.25073/2588-1086/vnucsce.332

How to Cite

THANH, Pham Viet et al. ASR - VLSP 2021: Semi-supervised Ensemble Model for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 38, n. 1, june 2022. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/332>. Date accessed: 26 aug. 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.332.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 38 No 1: Special Issue: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

Section

Special Issue on Vietnamese Language and Speech Processing (VLSP2021)

Abstract

Automatic speech recognition (ASR) is gaining huge advances with the arrival of End-to-End architectures. Semi-supervised learning methods, which can utilize unlabeled data, have largely contributed to the success of ASR systems, giving them the ability to surpass human performance. However, most of the researches focus on developing these techniques for English speech recognition, which raises concern about their performance in other languages, especially in low-resource scenarios. In this paper, we aim at proposing a Vietnamese ASR system for participating in the VLSP 2021 Automatic Speech Recognition Shared Task. The system is based on the Wav2vec 2.0 framework, along with the application of self-training and several data augmentation techniques. Experimental results show that on the ASR-T1 test set of the shared task, our proposed model achieved a remarkable result, ranked as the second place with a Syllable Error Rate (SyER) of 11.08%.

Article Sidebar

Article Details

Main Article Content

Abstract