Nguyen Thi Thu Trang, Huu Tuong Tu, Le Hoang Anh Tuan

Main Article Content

Abstract

VLSP 2025 marks the eleventh annual workshop organized by the Vietnamese Language
and Speech Processing community. This year, we introduce the inaugural Vietnamese Voice Conversion (VC) shared task, establishing a standardized benchmark for evaluating speech technologies
in the Vietnamese language. The task focuses on developing systems capable of converting a source
speaker’s voice to a target identity while preserving linguistic integrity and naturalness. To support
this initiative, we released a large-scale, multi-genre dataset comprising over 26 hours of speech from
100 speakers across diverse recording conditions. The challenge attracted 18 participating teams,
with the top-performing system-based on a multilingual diffusion-transformer architecture-achieving
a MOS of 4.29, an SMOS_TGT of 3.65, and a WER of 9.83. These results provide critical benchmarks and a robust foundation for future research in Vietnamese voice conversion.


Keywords: Voice conversion, Text to Speech, Multi-genre dataset, Benchmark evaluation.