VLSP 2025 challenge: Numerical Reasoning Question and Answer
Main Article Content
Abstract
The VLSP 2025 Shared Task on Numerical Reasoning Question Answering (NumQA)
is the first initiative to address numerical reasoning in Vietnamese financial texts. To support this
effort, we constructed ViNumQA, a large-scale benchmark dataset comprising over 4,000 manually validated question-program-answer triples. The dataset integrates two complementary sources:
a human-verified Vietnamese translation of FinQA and newly constructed QA pairs derived from
domestic corporate financial reports. Each instance requires systems to generate a transparent mathematical reasoning program and produce a final numerical answer, enabling explicit evaluation of
both reasoning correctness and result accuracy. We established robust baselines using the LLaMA
model family and compared them against state-of-the-art proprietary LLMs (GPT-4o, GPT-5 mini).
The results demonstrate that supervised fine-tuning is essential for adherence to reasoning schemas,
as few-shot prompting strategies suffered from high invalid generation rates. The shared task included two subtasks: (1) a constrained track focusing on efficient, reproducible modeling without
external APIs, and (2) an unconstrained track allowing LLM-assisted training. The best-performing
constrained model achieved the highest in both Program and Execution Accuracy. Meanwhile, an
inference-only agent attained a highly competitive Execution Accuracy without any fine-tuning. By
releasing ViNumQA and evaluating multiple methods, this work provides a key resource for Vietnamese financial NLP and reveals the balance between interpretability and accuracy in numerical
reasoning systems.
Keywords: Numerical Reasoning, Question Answering, viNumQA, VLSP 2025, Vietnamese