ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension

Dang Van Nhan; Nguyen Le Minh

doi:10.25073/2588-1086/vnucsce.336

Dang Van Nhan, Nguyen Le Minh

PDF

Published Dec 16, 2022

DOI: https://doi.org/10.25073/2588-1086/vnucsce.336

How to Cite

NHAN, Dang Van; MINH, Nguyen Le. ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 39, n. 2, dec. 2022. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/336>. Date accessed: 04 july 2025. doi: https://doi.org/10.25073/2588-1086/vnucsce.336.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 39 No 2

Section

Original Articles

Abstract

Nowadays, the amount of information has become huge, and our task is to find the correct answers to the questions. In fact, not every question has an answer, and then the best answer should be don’t know, where the model that makes the prediction is the empty string. Building a high-accuracy response model will make people’s lives easier. We have the SQuAD dataset for English that helps train the machine reading comprehension model. Based on SQuAD 2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD 2.0 [1], a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD 2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem [2] is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model (PLM) called XLM-RoBERTa (XLM-R [3]), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and selected the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set,…

Article Sidebar

Article Details

Main Article Content

Abstract