SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
Kaiwen Zhou 1,2, Ahmed Elgohary 2, A S M Iftekhar 2, Amin Saied 2
Published on arXiv
2510.26037
Prompt Injection
OWASP LLM Top 10 — LLM01
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
A distilled 8B red-teamer model achieves a 100% improvement in attack success rate against LLM agents, surpassing the 671B DeepSeek-R1 model.
SIRAJ
Novel technique introduced
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
Key Contributions
- SIRAJ: a generic black-box red-teaming framework for LLM agents combining dynamic seed test case generation (covering diverse risk outcomes and tool-calling trajectories) with iterative adversarial attack refinement based on execution trajectories
- A model distillation approach using structured forms of a teacher model's reasoning to train smaller, equally effective red-teamer models
- Distilled 8B red-teamer model that improves attack success rate by 100% over baseline, surpassing the 671B DeepSeek-R1 model, with seed generation yielding 2–2.5x boost in risk and trajectory coverage