Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Published on arXiv
2602.16520
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 92.5–98.0% recall and 98.99–100% precision across three LLM backends on AutoDAN-style jailbreak inputs with 0.0–2.0% false positive rates.
RLM-JB
Novel technique introduced
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.
Key Contributions
- RLM-based detection architecture that treats jailbreak detection as a bounded analysis program with a root model orchestrating de-obfuscation, chunking, and multi-worker screening
- Coverage-guaranteeing chunking with overlapping segments that recovers split-payload attacks by aggregating cross-chunk signals into a single compositional verdict
- Empirical evaluation across three LLM screening backends showing 92.5–98.0% recall with 98.99–100% precision and 0.0–2.0% FPR on AutoDAN-style adversarial inputs