ML Security Papers

Latest papers

5 papers

attack arXiv Feb 17, 2026 · 6w ago

Visual Memory Injection Attacks for Multi-Turn Conversations

Christian Schlarmann, Matthias Hein · University of Tübingen

Adversarial image perturbations hijack VLM outputs across 25+ conversation turns, triggering targeted misinformation only on specific user prompts

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

benchmark arXiv Jan 21, 2026 · 10w ago

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun et al. · Parameter Lab · TU Darmstadt +3 more

Benign fine-tuning silently breaks contextual privacy in LLMs, causing inappropriate data disclosure undetected by standard safety benchmarks

Transfer Learning Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp

1 citations PDF

attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp

5 citations PDF

benchmark arXiv Sep 22, 2025 · Sep 2025

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić et al. · ELLIS Institute Tübingen · Tübingen AI Center +5 more

Frontier LLMs spontaneously produce fake-harmful but actually-harmless responses that fool all tested jailbreak monitors, detectable only via activation probes

Prompt Injection nlp

1 citations PDF

Latest papers

Visual Memory Injection Attacks for Multi-Turn Conversations

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Safety Alignment of LMs via Non-cooperative Games

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue