TTS - VLSP 2021: The NAVI’s Text-To-Speech  System for Vietnamese

Nguyen Le Minh; An Quoc Do; Viet Quoc Vu; Huyen Thuc Khanh Vo

doi:10.25073/2588-1086/vnucsce.347

Nguyen Le Minh, An Quoc Do, Viet Quoc Vu, Huyen Thuc Khanh Vo

PDF

Published Jun 30, 2022

DOI: https://doi.org/10.25073/2588-1086/vnucsce.347

How to Cite

LE MINH, Nguyen et al. TTS - VLSP 2021: The NAVI’s Text-To-Speech System for Vietnamese. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 38, n. 1, june 2022. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/347>. Date accessed: 26 aug. 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.347.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 38 No 1: Special Issue: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

Section

Special Issue on Vietnamese Language and Speech Processing (VLSP2021)

Abstract

The Association for Vietnamese Language and Speech Processing (VLSP) has organized a series of workshops intending to bring together researchers and professionals working in NLP and attempt a synthesis of research in the Vietnamese language. One of the shared tasks held at the eighth workshop is TTS [14] using a dataset that only consists of spontaneous audio. This poses a challenge for current TTS models since they only perform well constructing reading-style speech (e.g, audiobook). Not only that, the quality of the audio provided by the dataset has a huge impact on the performance of the model. Specifically, samples with noisy backgrounds or with multiple voices speaking at the same time will deteriorate the performance of our model. In this paper, we describe our approach to tackle this problem: we first preprocess the training data then use it to train a FastSpeech2 [10] acoustic model with some replacements in the external aligner model, finally we use HiFiGAN [4] vocoder to construct the waveform. According to the official evaluation of VLSP 2021 competition in the TTS task, our approach achieves 3.729 in-domain MOS, 3.557 out-of-domain MOS, and 79.70% SUS score. Audio samples are available at https://navi-tts.github.io/.

Article Sidebar

Article Details

Main Article Content

Abstract