A Graph Attention Network-Based Vietnamese Multimodal Sentiment Analysis Model with Cross-Modal Fusion
Hoang Nam Do, Dinh Tai Pham

Main Article Content

Abstract

Abstract: Multimodal sentiment analysis (MSA) models often use simple integration methods that
overlook diverse intramodal relationships. Vietnamese, with its rich and complex grammatical and
lexical features, presents significant challenges. This paper proposes a new ViMACSA-GAT model
using a graph attention network with two parallel branches and a multimodal integration converter.
The model has the following main steps: The text modality branch employs the PhoBERT encoder
to process Vietnamese text, and the graph attention networks (GAT) model contextual dependencies.
For the image modality, the Image branch uses a vision transformer (ViT) to extract features, with
nodes representing both the global image and specific regions of interest (ROIs). Then, GAT is
used to capture the relationships between the extracted image elements. Next, using a cross-modal
transformer, deep merging is performed by simultaneously processing node-level representations
from both graphs, allowing for detailed intermodal alignment. The model is optimized with focal
loss to handle class imbalance and evaluated on the Vietnamese dataset called ViMACSA, achieving
advanced performance in Accuracy (Acc = 85.50%), Precision = 74.01%, Recall = 72.04%, F1-score
= 72.63%. The results evaluate the generality of the proposed model for Vietnamese MSA.
Keywords: Multimodal sentiment analysis, graph attention networks, multimodal fusion, vision
transformers, Vietnamese language processing, region of interest.