Adaptation in Statistical Machine Translation for Low-resource Domains in English-Vietnamese Language

Nghia-Luan Pham; Van-Vinh Nguyen

doi:10.25073/2588-1086/vnucsce.231

Nghia-Luan Pham, Van-Vinh Nguyen

pdf

Published May 30, 2020

DOI: https://doi.org/10.25073/2588-1086/vnucsce.231

How to Cite

PHAM, Nghia-Luan; NGUYEN, Van-Vinh. Adaptation in Statistical Machine Translation for Low-resource Domains in English-Vietnamese Language. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 36, n. 1, may 2020. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/231>. Date accessed: 12 july 2026. doi: https://doi.org/10.25073/2588-1086/vnucsce.231.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 36 No 1 (2020)

Section

Original Articles

Abstract

In this paper, we propose a new method for domain adaptation in Statistical Machine Translation for low-resource domains in English-Vietnamese language. Specifically, our method only uses monolingual data to adapt the translation phrase-table, our system brings improvements over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i) classify phrases on the target side of the translation phrase-table use the probability classifier model, and (ii) adapt to the phrase-table translation by recomputing the direct translation probability of phrases.

Our experiments are conducted with translation direction from English to Vietnamese on two very different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the experimental results showed that our method significantly outperformed the baseline system. Our system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores over the baseline system,…

Keywords:

Machine Translation, Statistical Machine Translation, Domain Adaptation

References

[1] Philipp Koehn, Franz Josef Och, Daniel Marcu, Statistical phrase-based translation, In Proceedings of HLT-NAACL, Edmonton, Canada, 2003, 127-133.

[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes and Jeffrey Dean, Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR, abs/1609.08144, 2016.

[3] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo and Marcello Federico, Neural versus phrase-based machine translation quality: A case study, 2016.

[4] Barry Haddow, Philipp Koehn, Analysing the effect of out-of-domain data on smt systems, In Proceedings of the Seventh Workshop on Statistical Machine Translation, 2012, 422-432.

[5] Boxing Chen, Roland Kuhn and George Foster, Vector space model for adaptation in statistical machine translation, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013, pp. 1285-1293.

[6] Daniel Dahlmeier, Hwee Tou Ng, Siew Mei Wu4, Building a large annotated corpus of learner english: The nus corpus of learner english, In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Appli-cations, 2013.

[7] Eva Hasler, Phil Blunsom, Philipp Koehn and Barry Haddow, Dynamic topic adaptation for phrase-based mt, In Proceedings of the 14th Conference of the European Chapter of The Association for Computational Linguistics, 2014, pp. 328-337.

[8] George Foster, Roland Kuhn, Mixture-model adaptation for smt, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Association for Computational Linguistics, 2007, pp. 128-135.

[9] George Foster, Boxing Chen, Roland Kuhn, Simulating discriminative training for linear mixture adaptation in statistical machine translation, Proceedings of the MT Summit, 2013.

[10] Hoang Cuong, Khalil Sima’an, and Ivan Titov, Adapting to all domains at once: Rewarding domain invariance in smt, Proceedings of the Transactions of the Association for Computational Linguistics (TACL), 2016.

[11] Ryo Masumura, Taichi Asam, Takanobu Oba, Hirokazu Masataki, Sumitaka Sakauchi, and Akinori Ito, Hierarchical latent words language models for robust modeling to out-of domain tasks, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1896-1901.

[12] Chenhui Chu, Raj Dabre, and Sadao Kurohashi. An empirical comparison of simple domain adaptation methods for neural machine translation, 2017.

[13] Markus Freitag, Yaser Al-Onaizan, Fast domain adaptation for neural machine translation, 2016.

[14] Jia Xu, Yonggang Deng, Yuqing Gao and Hermann Ney, Domain dependent statistical machine translation, In Proceedings of the MT Summit XI, 2007, pp. 515-520.

[15] Hua Wu, Haifeng Wang Chengqing Zong, Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora, In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 2008, pp. 993-1000.

[16] Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, A maximum entropy approach to natural language processing, Computational Linguistics, 22, 1996.

[17] 18Santanu Pal, Sudip Naskar, Josef Van Genabith, Uds-sant, English-German hybrid machine translation system, In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September, Association for Computational Linguistics, 2015, pp. 152-157.

[18] Louis Onrust, Antal van den Bosch, Hugo Van hamme, Improving cross-domain n-gram language modelling with skipgrams, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, pp. 137-142.

[19] Mark Aronoff, Kirsten Fudeman, What is morphology, V 8. john wiley and sons, 2011.

[20] Laurence C. Thompson, The problem of the word in vietnamese, In journal of the International Linguistic Association 19(1) (1963) 39-52. https:// doi.org/1080/00437956.1963.11659787.

[21] Binh N. Ngo, The Vietnamese language learning framework, Journal of Southeast Asian Language Teaching 10 (2001) 1-24.

[22] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, Ho Tuong Vinh, A hybrid approach to word segmentation of vietnamese texts, 2008.

[23] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open source toolkit for statistical machine translation, In ACL-2007: Proceedings of demo and poster sessions, Prague, Czech Republic, 2007, pp.177-180.

[24] Franz Josef Och, Minimum error rate training in statistical machine translation, In Proceedings of ACL, 2003, pp.160-167.

[25] Andreas Stolcke, Srilm - an extensible language modeling toolkit, in proceedings of international conference on spoken language processing, 2002.

[26] Papineni, Kishore, Salim Roukos, Todd Ward, WeiJing Zhu, Bleu: A method for automatic evaluation of machine translation, ACL, 2002.

[27] G. Klein, Y. Kim, Y. Deng, J. Senellart, A.M. Rush, OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints.

[28] Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip Kr. Naskar, Andy Way and Josef van Genabith, Combining multi-domain statistical machine translation models using automatic classifiers, In Proceedings of AMTA 2010., 2010.

Article Sidebar

Article Details

Main Article Content

Abstract