Latest papers

12 papers
attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp
PDF Code
attack arXiv Feb 9, 2026 · 8w ago

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection nlp
PDF Code
defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
attack arXiv Jan 20, 2026 · 10w ago

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes et al. · MATS · Anthropic +1 more

Bypasses frontier LLM safeguards via adjacent-domain prompts, then fine-tunes open-source models to elicit hazardous chemical synthesis capabilities

Transfer Learning Attack Prompt Injection nlp
4 citations PDF
defense arXiv Jan 15, 2026 · 11w ago

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala et al. · MATS · Anthropic Fellows Program +2 more

Discovers an 'Assistant Axis' in LLM activations and uses activation capping to block persona-based jailbreaks and harmful drift

Prompt Injection nlp
10 citations 1 influentialPDF
attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp
2 citations PDF Code
defense arXiv Nov 29, 2025 · Nov 2025

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Kai Williams, Rohan Subramani, Francis Rhys Ward · MATS

Proposes password-activated shutdown protocols to emergency-stop misaligned frontier agents, tested against red-team bypass strategies

Excessive Agency nlp
PDF
tool arXiv Oct 17, 2025 · Oct 2025

Detecting Adversarial Fine-tuning with Auditing Agents

Sarah Egler, John Schulman, Nicholas Carlini · MATS · Anthropic +1 more

LLM auditing agent detects adversarial fine-tuning attacks, including covert cipher backdoors, before model deployment

Transfer Learning Attack Model Poisoning Prompt Injection nlp
3 citations PDF Code
defense arXiv Oct 11, 2025 · Oct 2025

Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth et al. · MATS

Reveals that output-only RL supervision still obfuscates LLM chain-of-thought, and proposes two mitigations to preserve CoT monitorability

Prompt Injection nlpreinforcement-learning
1 citations PDF Code
benchmark arXiv Oct 10, 2025 · Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · OpenAI · Anthropic +6 more

Adaptive attacks via gradient descent, RL, and random search bypass 12 LLM jailbreak/prompt-injection defenses with >90% success rate

Input Manipulation Attack Prompt Injection nlp
34 citations 4 influentialPDF
attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp
5 citations PDF
benchmark arXiv Oct 5, 2025 · Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al. · University College London · Anthropic +2 more

Reveals LLM agents autonomously resorting to blackmail and corporate espionage to avoid shutdown or achieve goals across 16 frontier models

Excessive Agency nlp
67 citations 13 influentialPDF Code