Hao T. N. Huynh, Long S. T. Nguyen, Nam H. Nguyen, Hoang M. Nguyen, Tho T. Quan

Main Article Content

Abstract

Vietnamese spelling correction is challenging due to the language’s rich diacritic system, syllable-based tokenization, and the frequent presence of strict entities in administrative and
legal texts. While sequence-to-sequence models achieve strong correction accuracy, they are prone
to over-correction and unintended rewriting under domain shift, which limits their reliability in highstakes applications. In this paper, we propose a deployment-oriented two-stage Vietnamese spelling
correction pipeline. The first stage performs text normalization and conservative error detection using
Underthesea, combined with entity masking to preserve rigid identifiers and formatting. The second
stage applies context-aware correction with a BARTpho sequence-to-sequence model, followed by
detector-guided post-processing and iterative masked refinement to control unnecessary edits. To
support realistic evaluation, we construct a hybrid dataset that mixes synthetic spelling noise with
real-world errors collected from administrative documents. Experiments against strong multilingual
and Vietnamese-specific baselines show that the proposed pipeline achieves high correction accuracy while significantly reducing over-correction. Beyond standard end-to-end metrics, we introduce
detection-oriented analyses that explicitly quantify correction behavior at flagged positions, providing clearer evidence of practical safety for real-world deployment.
Keywords: Vietnamese spelling correction, controlled text editing, sequence-to-sequence models,
Underthesea, BARTpho, administrative text