A Two-Stage Vietnamese Spelling Correction Pipeline Combining Underthesea and BARTpho

Hao T. N. Huynh; Long S. T. Nguyen; Nam H. Nguyen; Hoang M. Nguyen; Tho T. Quan

doi:10.25073/2588-1086/vnucsce.7020

Hao T. N. Huynh, Long S. T. Nguyen, Nam H. Nguyen, Hoang M. Nguyen, Tho T. Quan

PDF

Published May 19, 2026

DOI: https://doi.org/10.25073/2588-1086/vnucsce.7020

How to Cite

HUYNH, Hao T. N. et al. A Two-Stage Vietnamese Spelling Correction Pipeline Combining Underthesea and BARTpho. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], may 2026. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/7020>. Date accessed: 02 july 2026. doi: https://doi.org/10.25073/2588-1086/vnucsce.7020.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Article in Press

Section

Original Articles

Abstract

Abstract: Vietnamese spelling correction is challenging due to the language’s rich diacritic system, syllable-based tokenization, and the frequent presence of strict entities in administrative and
legal texts. While sequence-to-sequence models achieve strong correction accuracy, they are prone
to over-correction and unintended rewriting under domain shift, which limits their reliability in highstakes applications. In this paper, we propose a deployment-oriented two-stage Vietnamese spelling
correction pipeline. The first stage performs text normalization and conservative error detection using
Underthesea, combined with entity masking to preserve rigid identifiers and formatting. The second
stage applies context-aware correction with a BARTpho sequence-to-sequence model, followed by
detector-guided post-processing and iterative masked refinement to control unnecessary edits. To
support realistic evaluation, we construct a hybrid dataset that mixes synthetic spelling noise with
real-world errors collected from administrative documents. Experiments against strong multilingual
and Vietnamese-specific baselines show that the proposed pipeline achieves high correction accuracy while significantly reducing over-correction. Beyond standard end-to-end metrics, we introduce
detection-oriented analyses that explicitly quantify correction behavior at flagged positions, providing clearer evidence of practical safety for real-world deployment.
Keywords: Vietnamese spelling correction, controlled text editing, sequence-to-sequence models,
Underthesea, BARTpho, administrative text.

Article Sidebar

Article Details

Main Article Content

Abstract