ML Security Papers

Latest papers

10 papers

attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp

PDF Code

benchmark arXiv Feb 23, 2026 · 6w ago

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi et al. · Max Planck Institute for Intelligent Systems · Snyk

Benchmarks LLM agent susceptibility to skill-file prompt injection, finding up to 80% attack success on frontier models

Prompt Injection Insecure Plugin Design nlp

PDF Code

benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Jan 29, 2026 · 9w ago

LoRA and Privacy: When Random Projections Help (and When They Don't)

Yaxi Hu, Johanna Düngler, Bernhard Schölkopf et al. · Max Planck Institute for Intelligent Systems · University of Copenhagen

Proves LoRA lacks inherent privacy via near-perfect MIA, then derives tighter DP bounds for noisy low-rank fine-tuning

Membership Inference Attack nlp

PDF

benchmark arXiv Dec 30, 2025 · Dec 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Intelligent Systems +1 more

Benchmarks jailbreak attacks against full LLM deployment pipelines with safety filters, finding prior studies overestimated attack success

Prompt Injection nlp

PDF

benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp

PDF

benchmark arXiv Nov 7, 2025 · Nov 2025

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Amr Gomaa, Ahmed Salem, Sahar Abdelnabi · German Research Center for Artificial Intelligence · Microsoft +3 more

Benchmarks privacy leakage and prompt-injection-style attacks across 864 multi-turn agent-to-agent LLM conversations in three domains

Prompt Injection Sensitive Information Disclosure nlp

5 citations 2 influentialPDF Code

attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp

5 citations PDF

benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp

PDF Code

defense arXiv Sep 1, 2025 · Sep 2025

Model Unmerging: Making Your Models Unmergeable for Secure Model Sharing

Zihao Wang, Enneng Yang, Lu Yin et al. · Sun Yat-Sen University · University of Surrey +1 more

Protects fine-tuned model IP by disrupting attention parameter space to prevent unauthorized model merging without affecting model utility

Model Theft visionnlp

PDF Code

Latest papers

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

LoRA and Privacy: When Random Projections Help (and When They Don't)

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Are LLMs Good Safety Agents or a Propaganda Engine?

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Model Unmerging: Making Your Models Unmergeable for Secure Model Sharing

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue