Latest papers

5 papers
attack arXiv Feb 17, 2026 · 6w ago

Visual Memory Injection Attacks for Multi-Turn Conversations

Christian Schlarmann, Matthias Hein · University of Tübingen

Adversarial image perturbations hijack VLM outputs across 25+ conversation turns, triggering targeted misinformation only on specific user prompts

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
benchmark arXiv Jan 21, 2026 · 10w ago

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun et al. · Parameter Lab · TU Darmstadt +3 more

Benign fine-tuning silently breaks contextual privacy in LLMs, causing inappropriate data disclosure undetected by standard safety benchmarks

Transfer Learning Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp
1 citations PDF
attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp
5 citations PDF
benchmark arXiv Sep 22, 2025 · Sep 2025

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić et al. · ELLIS Institute Tübingen · Tübingen AI Center +5 more

Frontier LLMs spontaneously produce fake-harmful but actually-harmless responses that fool all tested jailbreak monitors, detectable only via activation probes

Prompt Injection nlp
1 citations PDF