Truong Phi Ho, Hoang Thanh Nam, Pham Minh Thuan, Nguyen The Hung, Nguyen Nhat Hai, Bui Thu Lam

Main Article Content

Abstract

Abstract: Large Language Models (LLMs) employ advanced safety alignment mechanisms designed to prevent the generation of harmful or malicious content. Despite these protections, LLMs
remain susceptible to a range of security vulnerabilities that can be exploited by adversaries using
various attack techniques. In this work, we introduce a novel evaluation method based on automated
black-box adversarial attack approach, designed to assess and probe LLMs without requiring access
to their internal architecture. Our method leverages a characteristic failure mode commonly observed during GAN training, known as mode collapse. Specifically, we train an enhanced Sequence
Generative Adversarial Network (SeqGAN) on the proposed data set of 808,700 questions until the
generator enters a collapsed state, producing highly junk sequences. We refer to this resulting model
as J-SeqGAN (Junk generated by Sequences-GAN), highlighting its generation of ”junk” sequences
that form the basis for subsequent adversarial attacks. We show that these sequences function as
adversarial noise, implicitly optimized through the adversarial dynamics of GAN training. They can
be employed to evaluate the robustness of Large Language Models, specifically their ability to resist
exploitation and prevent the generation of harmful or malicious content.
Keywords: Large Language Models, Generative Adversarial Networks, Adversarial noise,
Black-box testing, Jailbreak.