VLSP 2021 - TTS Challenge: Vietnamese Spontaneous Speech Synthesis

Nguyen Thi Thu Trang; Hoang Ky Nguyen

doi:10.25073/2588-1086/vnucsce.358

Spontaneous Speech Synthesis for Vietnamese

Nguyen Thi Thu Trang, Hoang Ky Nguyen

PDF

Published Jun 30, 2022

DOI: https://doi.org/10.25073/2588-1086/vnucsce.358

How to Cite

TRANG, Nguyen Thi Thu; NGUYEN, Hoang Ky. VLSP 2021 - TTS Challenge: Vietnamese Spontaneous Speech Synthesis. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 38, n. 1, june 2022. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/358>. Date accessed: 26 aug. 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.358.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 38 No 1: Special Issue: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

Section

Special Issue on Vietnamese Language and Speech Processing (VLSP2021)

Abstract

Text-To-Speech (TTS) was one of nine shared tasks in the eighth annual international VLSP 2021 workshop. All three previous TTS shared tasks were conducted on reading datasets. However, the synthetic voices were not natural enough for spoken dialog systems where the computer must talk to the human in a conversation. Speech datasets recorded in a spontaneous environment help a TTS system to produce more natural voices in speaking style, speaking rate, intonation... Therefore, in this shared task, participants were asked to build a TTS system from a spontaneous speech dataset. This 7.5-hour dataset was collected from a channel of a famous youtuber "Giang ơi..."and then pre-processed to build utterances and their corresponding texts. Main challenges at this task this year were: (i) inconsistency in speaking rate, intensity, stress and prosody across the dataset, (ii) background noises or mixed with other voices, and (iii) inaccurate transcripts. A total of 43 teams registered to participate in this shared task, and finally, 8 submissions were evaluated online with perceptual tests. Two types of perceptual tests were conducted: (i) MOS test for naturalness and (ii) SUS (Semantically Unpredictable Sentences) test for intelligibility. The best SUS intelligibility TTS system had a syllable error rate of 15%, while the best MOS score on dialog utterances was 3.98 over 4.54 points on a 5-point MOS scale. The prosody and speaking rate of synthetic voices were similar to the natural one. However, there were still some distorted segments and background noises in most of TTS systems, a half of which had a syllable error rate of
at least 30%.

Article Sidebar

Article Details

Main Article Content

Abstract