attack 2025

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Muxi Diao ¹, Yutao Mou ², Keqing He ¹, Hanbo Song ¹, Lulu Zhao ³, Shikun Zhang ², Wei Ye ², Kongming Liang ¹, Zhanyu Ma ¹

¹ Beijing University of Posts and Telecommunications

² Peking University

³ Beijing Academy of Artificial Intelligence

0 citations · 47 references · arXiv

Published on arXiv

2510.08329

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AutoRed achieves higher attack success rates and better generalization than existing seed-based red teaming baselines (GCG, AutoDAN, CodeChameleon, Rainbow Teaming) across eight state-of-the-art LLMs including GPT-4o.

AutoRed

Novel technique introduced

The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.

Key Contributions

Free-form adversarial prompt generation framework (AutoRed) that eliminates reliance on seed instruction sets, enabling greater semantic diversity in red teaming prompts
Two-stage pipeline: persona-guided adversarial instruction generation followed by a reflection loop that iteratively refines low-quality prompts
Two red teaming datasets (AutoRed-Medium and AutoRed-Hard) used to evaluate eight state-of-the-art LLMs, demonstrating higher attack success rates and better generalization than seed-based baselines

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

AutoRed-MediumAutoRed-HardAdvBenchHH-RLHFHarmfulQABeaver

Applications

llm safety evaluationchatbot security

Read PDF arXiv DOI

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Involuntary Jailbreak: On Self-Prompting Attacks

StealthGraph: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate