attack 2026

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

Jiayi Yuan 1,2, Jonathan Nöther 2, Natasha Jaques 1, Goran Radanović 2

0 citations · arXiv

α

Published on arXiv

2601.13518

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AgenticRed achieves 96% ASR on Llama-2-7B and 100% ASR on GPT-3.5-Turbo and GPT-4o, outperforming prior automated red-teaming methods by up to 36%.

AgenticRed

Novel technique introduced


While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red-teaming. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B (36% improvement) and 98% on Llama-3-8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT-3.5-Turbo and GPT-4o, and 60% on Claude-Sonnet-3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.


Key Contributions

  • AgenticRed: an automated pipeline that treats red-teaming as a system design problem, using LLM in-context learning to iteratively discover and refine attack workflows without human intervention
  • A novel evolutionary selection procedure adapted from Meta Agent Search to evolve agentic red-teaming systems across generations
  • Achieves state-of-the-art attack success rates — 96% on Llama-2-7B (+36%), 100% on GPT-3.5-Turbo and GPT-4o, 60% on Claude-Sonnet-3.5 (+24%) on HarmBench

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
HarmBench
Applications
llm safety evaluationjailbreaking language models