Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
Zhang Wei 1, Peilu Hu 1, Zhenyuan Wei 2, Chenwei Liang 2, Jing Luo 3, Ziyi Ni 4, Hao Yan 5, Li Mei 1, Shengning Lang 5, Kuan Lu 6, Xi Xiao 7, Zhimo Han 8, Yijin Wang 9, Yichao Zhang 10, Chen Yang 11, Junfeng Hao 12, Jiayi Gu 13, Riyang Bao 14, Mu-Jiang-Shan Wang 2
2 Shenzhen Kaihong Digital Industry Development Co., Ltd.
5 Stevens Institute of Technology
7 Oak Ridge National Laboratory
8 Zhengzhou University of Light Industry
10 The University of Texas at Dallas
11 Institute of Advanced Computing
12 Guangdong Medical University
Published on arXiv
2512.20677
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Automated framework discovers 47 vulnerabilities (21 high-severity) on GPT-OSS-20B with a 3.9× higher discovery rate than manual red-teaming at matched query budgets and 89% detection accuracy.
The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.
Key Contributions
- Formulates automated LLM red-teaming as a structured adversarial search problem with a learning-driven meta-prompt generation pipeline
- Hierarchical execution and detection pipeline enabling standardized evaluation across six representative threat categories (reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, chain-of-thought manipulation)
- Demonstrates 3.9× higher vulnerability discovery rate vs. manual red-teaming with 89% detection accuracy, identifying 47 vulnerabilities including 12 previously undocumented attack patterns on GPT-OSS-20B