A Novel Black-Box Evaluation Method for Large Language Models Based on Generative Adversarial Networks

Truong Phi Ho; Hoang Thanh Nam; Pham Minh Thuan; Nguyen The Hung; Nguyen Nhat Hai; Bui Thu Lam

doi:10.25073/2588-1086/vnucsce.6992

Truong Phi Ho, Hoang Thanh Nam, Pham Minh Thuan, Nguyen The Hung, Nguyen Nhat Hai, Bui Thu Lam

PDF

Published Mar 16, 2026

DOI: https://doi.org/10.25073/2588-1086/vnucsce.6992

How to Cite

HO, Truong Phi et al. A Novel Black-Box Evaluation Method for Large Language Models Based on Generative Adversarial Networks. VNU Journal of Science: Computer Science and Communication Engineering, [S.l.], v. 42, n. 1, mar. 2026. ISSN 2588-1086. Available at: <//jcsce.vnu.edu.vn/index.php/jcsce/article/view/6992>. Date accessed: 08 july 2026. doi: https://doi.org/10.25073/2588-1086/vnucsce.6992.

ABNT APA BibTeX CBE EndNote - EndNote format (Macintosh & Windows) MLA ProCite - RIS format (Macintosh & Windows) RefWorks Reference Manager - RIS format (Windows only) Turabian

Issue

Vol 42 No 1 (2026)

Section

Original Articles

Abstract

Abstract: Large Language Models (LLMs) employ advanced safety alignment mechanisms designed to prevent the generation of harmful or malicious content. Despite these protections, LLMs
remain susceptible to a range of security vulnerabilities that can be exploited by adversaries using
various attack techniques. In this work, we introduce a novel evaluation method based on automated
black-box adversarial attack approach, designed to assess and probe LLMs without requiring access
to their internal architecture. Our method leverages a characteristic failure mode commonly observed during GAN training, known as mode collapse. Specifically, we train an enhanced Sequence
Generative Adversarial Network (SeqGAN) on the proposed data set of 808,700 questions until the
generator enters a collapsed state, producing highly junk sequences. We refer to this resulting model
as J-SeqGAN (Junk generated by Sequences-GAN), highlighting its generation of ”junk” sequences
that form the basis for subsequent adversarial attacks. We show that these sequences function as
adversarial noise, implicitly optimized through the adversarial dynamics of GAN training. They can
be employed to evaluate the robustness of Large Language Models, specifically their ability to resist
exploitation and prevent the generation of harmful or malicious content.
Keywords: Large Language Models, Generative Adversarial Networks, Adversarial noise,
Black-box testing, Jailbreak.

Article Sidebar

Article Details

Main Article Content

Abstract