attack 2025

Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge 1, Paul Colognese , Avery Griffin 1, Tyler Tracy 2, Jon Kutasov 1, Joe Benton 1

3 citations · 1 influential · 16 references · arXiv

α

Published on arXiv

2511.02823

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Optimized attack scaffold reduces safety score from 0.87 to 0.41 on SHADE-Arena, with modular interventions that stack additively when combined.

SHADE attack scaffold

Novel technique introduced


As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.


Key Contributions

  • Decomposes attack capability into five modular skills (suspicion modeling, attack selection, plan synthesis, execution, subtlety) enabling targeted optimization of each component independently.
  • Develops a probabilistic simulation model of attack dynamics fitted to real SHADE-Arena transcripts, enabling hyperparameter optimization and counterfactual reasoning without expensive end-to-end rollouts.
  • Demonstrates stacking of modular interventions reduces safety score from 0.87 to 0.41 on SHADE-Arena, more than doubling attack strength over the baseline policy.

🛡️ Threat Analysis


Details

Domains
reinforcement-learningnlp
Model Types
llm
Threat Tags
grey_boxinference_timetargeted
Datasets
SHADE-Arena
Applications
ai agent safety control evaluationagentic llm red-teaming