Latest papers

6 papers
attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp
PDF Code
benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Nov 7, 2025 · Nov 2025

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Amr Gomaa, Ahmed Salem, Sahar Abdelnabi · German Research Center for Artificial Intelligence · Microsoft +3 more

Benchmarks privacy leakage and prompt-injection-style attacks across 864 multi-turn agent-to-agent LLM conversations in three domains

Prompt Injection Sensitive Information Disclosure nlp
5 citations 2 influentialPDF Code
attack arXiv Oct 30, 2025 · Oct 2025

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko · ELLIS Institute Tübingen · MPI for Intelligent Systems +1 more

Exploits LLM Agent Skills plugin framework for trivial indirect prompt injection, exfiltrating files and bypassing Claude Code guardrails

Prompt Injection Insecure Plugin Design nlp
8 citations 1 influentialPDF Code
attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp
5 citations PDF
benchmark arXiv Sep 22, 2025 · Sep 2025

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić et al. · ELLIS Institute Tübingen · Tübingen AI Center +5 more

Frontier LLMs spontaneously produce fake-harmful but actually-harmless responses that fool all tested jailbreak monitors, detectable only via activation probes

Prompt Injection nlp
1 citations PDF