ML Security Papers

Latest papers

6 papers

attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp

PDF Code

benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp

PDF

benchmark arXiv Nov 7, 2025 · Nov 2025

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Amr Gomaa, Ahmed Salem, Sahar Abdelnabi · German Research Center for Artificial Intelligence · Microsoft +3 more

Benchmarks privacy leakage and prompt-injection-style attacks across 864 multi-turn agent-to-agent LLM conversations in three domains

Prompt Injection Sensitive Information Disclosure nlp

5 citations 2 influentialPDF Code

attack arXiv Oct 30, 2025 · Oct 2025

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko · ELLIS Institute Tübingen · MPI for Intelligent Systems +1 more

Exploits LLM Agent Skills plugin framework for trivial indirect prompt injection, exfiltrating files and bypassing Claude Code guardrails

Prompt Injection Insecure Plugin Design nlp

8 citations 1 influentialPDF Code

attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp

5 citations PDF

benchmark arXiv Sep 22, 2025 · Sep 2025

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić et al. · ELLIS Institute Tübingen · Tübingen AI Center +5 more

Frontier LLMs spontaneously produce fake-harmful but actually-harmless responses that fool all tested jailbreak monitors, detectable only via activation probes

Prompt Injection nlp

1 citations PDF

Latest papers

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue