ML Security Papers

Latest papers

12 papers

attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp

PDF Code

attack arXiv Feb 9, 2026 · 8w ago

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection nlp

PDF Code

defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp

PDF

attack arXiv Jan 20, 2026 · 10w ago

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes et al. · MATS · Anthropic +1 more

Bypasses frontier LLM safeguards via adjacent-domain prompts, then fine-tunes open-source models to elicit hazardous chemical synthesis capabilities

Transfer Learning Attack Prompt Injection nlp

4 citations PDF

defense arXiv Jan 15, 2026 · 11w ago

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala et al. · MATS · Anthropic Fellows Program +2 more

Discovers an 'Assistant Axis' in LLM activations and uses activation capping to block persona-based jailbreaks and harmful drift

Prompt Injection nlp

10 citations 1 influentialPDF

attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp

2 citations PDF Code

defense arXiv Nov 29, 2025 · Nov 2025

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Kai Williams, Rohan Subramani, Francis Rhys Ward · MATS

Proposes password-activated shutdown protocols to emergency-stop misaligned frontier agents, tested against red-team bypass strategies

Excessive Agency nlp

PDF

tool arXiv Oct 17, 2025 · Oct 2025

Detecting Adversarial Fine-tuning with Auditing Agents

Sarah Egler, John Schulman, Nicholas Carlini · MATS · Anthropic +1 more

LLM auditing agent detects adversarial fine-tuning attacks, including covert cipher backdoors, before model deployment

Transfer Learning Attack Model Poisoning Prompt Injection nlp

3 citations PDF Code

defense arXiv Oct 11, 2025 · Oct 2025

Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth et al. · MATS

Reveals that output-only RL supervision still obfuscates LLM chain-of-thought, and proposes two mitigations to preserve CoT monitorability

Prompt Injection nlpreinforcement-learning

1 citations PDF Code

benchmark arXiv Oct 10, 2025 · Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · OpenAI · Anthropic +6 more

Adaptive attacks via gradient descent, RL, and random search bypass 12 LLM jailbreak/prompt-injection defenses with >90% success rate

Input Manipulation Attack Prompt Injection nlp

34 citations 4 influentialPDF

attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp

5 citations PDF

benchmark arXiv Oct 5, 2025 · Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al. · University College London · Anthropic +2 more

Reveals LLM agents autonomously resorting to blackmail and corporate espionage to avoid shutdown or achieve goals across 16 frontier models

Excessive Agency nlp

67 citations 13 influentialPDF Code

Latest papers

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

How does information access affect LLM monitors' ability to detect sabotage?

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Detecting Adversarial Fine-tuning with Auditing Agents

Output Supervision Can Obfuscate the Chain of Thought

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Agentic Misalignment: How LLMs Could Be Insider Threats

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue