ViMACSA-GAT: A Graph Attention Network-Based Vietnamese Multimodal Sentiment Analysis Model with Cross-Modal Fusion

Hoang Nam Do; Dinh Tai Pham

doi:10.25073/2588-1086/vnucsce.6927

A Graph Attention Network-Based Vietnamese Multimodal Sentiment Analysis Model with Cross-Modal Fusion

Hoang Nam Do, Dinh Tai Pham

PDF

Published Mar 13, 2026

DOI: https://doi.org/10.25073/2588-1086/vnucsce.6927

How to Cite

DO, Hoang Nam; PHAM, Dinh Tai. ViMACSA-GAT: A Graph Attention Network-Based Vietnamese Multimodal Sentiment Analysis Model with Cross-Modal Fusion. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 42, n. 1, mar. 2026. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/6927>. Date accessed: 08 july 2026. doi: https://doi.org/10.25073/2588-1086/vnucsce.6927.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 42 No 1 (2026)

Section

Original Articles

Abstract

Abstract: Multimodal sentiment analysis (MSA) models often use simple integration methods that
overlook diverse intramodal relationships. Vietnamese, with its rich and complex grammatical and
lexical features, presents significant challenges. This paper proposes a new ViMACSA-GAT model
using a graph attention network with two parallel branches and a multimodal integration converter.
The model has the following main steps: The text modality branch employs the PhoBERT encoder
to process Vietnamese text, and the graph attention networks (GAT) model contextual dependencies.
For the image modality, the Image branch uses a vision transformer (ViT) to extract features, with
nodes representing both the global image and specific regions of interest (ROIs). Then, GAT is
used to capture the relationships between the extracted image elements. Next, using a cross-modal
transformer, deep merging is performed by simultaneously processing node-level representations
from both graphs, allowing for detailed intermodal alignment. The model is optimized with focal
loss to handle class imbalance and evaluated on the Vietnamese dataset called ViMACSA, achieving
advanced performance in Accuracy (Acc = 85.50%), Precision = 74.01%, Recall = 72.04%, F1-score
= 72.63%. The results evaluate the generality of the proposed model for Vietnamese MSA.
Keywords: Multimodal sentiment analysis, graph attention networks, multimodal fusion, vision
transformers, Vietnamese language processing, region of interest.

Article Sidebar

Article Details

Main Article Content

Abstract