α

Published on arXiv

2512.20677

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Automated framework discovers 47 vulnerabilities (21 high-severity) on GPT-OSS-20B with a 3.9× higher discovery rate than manual red-teaming at matched query budgets and 89% detection accuracy.


The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.


Key Contributions

  • Formulates automated LLM red-teaming as a structured adversarial search problem with a learning-driven meta-prompt generation pipeline
  • Hierarchical execution and detection pipeline enabling standardized evaluation across six representative threat categories (reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, chain-of-thought manipulation)
  • Demonstrates 3.9× higher vulnerability discovery rate vs. manual red-teaming with 89% detection accuracy, identifying 47 vulnerabilities including 12 previously undocumented attack patterns on GPT-OSS-20B

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
GPT-OSS-20BHarmBenchXSTest
Applications
llm safety evaluationautomated red-teamingvulnerability discovery