AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming
Muxi Diao 1, Yutao Mou 2, Keqing He 1, Hanbo Song 1, Lulu Zhao 3, Shikun Zhang 2, Wei Ye 2, Kongming Liang 1, Zhanyu Ma 1
Published on arXiv
2510.08329
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
AutoRed achieves higher attack success rates and better generalization than existing seed-based red teaming baselines (GCG, AutoDAN, CodeChameleon, Rainbow Teaming) across eight state-of-the-art LLMs including GPT-4o.
AutoRed
Novel technique introduced
The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.
Key Contributions
- Free-form adversarial prompt generation framework (AutoRed) that eliminates reliance on seed instruction sets, enabling greater semantic diversity in red teaming prompts
- Two-stage pipeline: persona-guided adversarial instruction generation followed by a reflection loop that iteratively refines low-quality prompts
- Two red teaming datasets (AutoRed-Medium and AutoRed-Hard) used to evaluate eight state-of-the-art LLMs, demonstrating higher attack success rates and better generalization than seed-based baselines