Automated framework discovers 47 vulnerabilities (21 high-severity) on GPT-OSS-20B with a 3.9× higher discovery rate than manual red-teaming at matched query budgets and 89% detection accuracy.

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.

Key Contributions

Formulates automated LLM red-teaming as a structured adversarial search problem with a learning-driven meta-prompt generation pipeline
Hierarchical execution and detection pipeline enabling standardized evaluation across six representative threat categories (reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, chain-of-thought manipulation)
Demonstrates 3.9× higher vulnerability discovery rate vs. manual red-teaming with 89% detection accuracy, identifying 47 vulnerabilities including 12 previously undocumented attack patterns on GPT-OSS-20B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

GPT-OSS-20BHarmBenchXSTest

Applications

llm safety evaluationautomated red-teamingvulnerability discovery

Read PDF arXiv DOI

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

Quantifying Document Impact in RAG-LLMs

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models