ML Security Papers

Latest papers

34 papers

benchmark arXiv Apr 29, 2026 · 22d ago

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

Soheil Khodayari, Xuenan Zhang, Bhupendra Acharya et al. · Independent Researcher · CISPA Helmholtz Center for Information Security +1 more

Discovers 15.3K real-world indirect prompt injections across 1.2B URLs targeting LLM crawlers, agents, and automation systems

Prompt Injection nlpmultimodal

PDF

benchmark arXiv Apr 17, 2026 · 4w ago

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian et al. · Xi’an Jiaotong University · CISPA Helmholtz Center for Information Security

Benchmark evaluating compositional safety vulnerabilities in text-to-image models when benign concepts combine to create unsafe outputs

Input Manipulation Attack visiongenerative

PDF

defense arXiv Apr 17, 2026 · 4w ago

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Wai Man Si, Mingjie Li, Michael Backes et al. · CISPA Helmholtz Center for Information Security

Prunes model parameters responsible for unsafe LLM outputs, reducing harmful generations and jailbreak success with minimal utility loss

Prompt Injection nlpmultimodal

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen et al. · University of Electronic Science and Technology of China · Flexera +2 more

Evaluates six fine-tuning methods for both misaligning safety-aligned LLMs and realigning them, revealing asymmetric attack-defense dynamics

Transfer Learning Attack Prompt Injection Training Data Poisoning nlp

PDF Code

benchmark arXiv Mar 12, 2026 · 10w ago

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Junjie Chu, Yiting Qu, Ye Leng et al. · CISPA Helmholtz Center for Information Security · Delft University of Technology

Benchmarks LLM safety alignment failures when harmful content is embedded in benign tasks like translation, revealing a content-level ethical blind spot

Prompt Injection nlp

PDF

defense arXiv Mar 3, 2026 · 11w ago

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more

Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality

Prompt Injection visionmultimodalgenerative

PDF Code

attack arXiv Mar 1, 2026 · 11w ago

Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction

Huajie Chen, Tianqing Zhu, Hailin Yang et al. · City University of Macau · CISPA Helmholtz Center for Information Security +1 more

Pixel-wise reconstruction attack removes AI-image watermarks without querying detectors or knowing the watermarking scheme

Output Integrity Attack visiongenerative

PDF

attack arXiv Mar 1, 2026 · 11w ago

Turning Black Box into White Box: Dataset Distillation Leaks

Huajie Chen, Tianqing Zhu, Yuchen Zhong et al. · City University of Macau · CISPA Helmholtz Center for Information Security +2 more

Reveals that dataset distillation leaks training data via three-stage attack: architecture inference, membership inference, and model inversion

Model Inversion Attack Membership Inference Attack vision

PDF

attack arXiv Feb 28, 2026 · 11w ago

Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning

Dariush Wahdany, Matthew Jagielski, Adam Dziedzic et al. · CISPA Helmholtz Center for Information Security · Anthropic

Membership inference attacks expose private data leakage in curation pipelines even when models train only on public data

Membership Inference Attack vision

PDF

attack arXiv Feb 9, 2026 · Feb 2026

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li et al. · CISPA Helmholtz Center for Information Security

Discovers unsafe routing configurations in MoE LLMs that bypass safety alignment, achieving 0.98 ASR on AdvBench via router optimization

Prompt Injection nlp

PDF Code

attack arXiv Jan 29, 2026 · Jan 2026

Hardware-Triggered Backdoors

Jonas Möller, Erik Imgrund, Thorsten Eisenhofer et al. · Berlin Institute for the Foundations of Learning and Data · TU Berlin +1 more

Exploits GPU floating-point numerical variations to inject hardware-specific backdoors that flip model predictions only on targeted accelerators

Model Poisoning vision

PDF

benchmark arXiv Jan 26, 2026 · Jan 2026

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Devansh Srivastav, David Pape, Lea Schönherr · CISPA Helmholtz Center for Information Security

Taxonomizes hidden covert LLM behaviors induced by adversarial developers and shows detection systematically fails in open-world conditions

Model Poisoning Prompt Injection nlp

PDF

defense arXiv Jan 8, 2026 · Jan 2026

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Polina Dolgova, Sebastian U. Stich · CISPA Helmholtz Center for Information Security · Universität des Saarlandes

Defends against membership inference on forgotten data via block-wise noise injection that preserves certified (ε,δ) unlearning guarantees with far less accuracy loss

Membership Inference Attack vision

PDF

benchmark arXiv Dec 30, 2025 · Dec 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Intelligent Systems +1 more

Benchmarks jailbreak attacks against full LLM deployment pipelines with safety filters, finding prior studies overestimated attack success

Prompt Injection nlp

PDF

survey arXiv Dec 10, 2025 · Dec 2025

Chasing Shadows: Pitfalls in LLM Security Research

Jonathan Evertz, Niklas Risse, Nicolai Neuer et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Security and Privacy +4 more

Surveys nine methodological pitfalls in LLM security research found in all 72 surveyed papers, with case studies showing how each misleads results

Data Poisoning Attack Prompt Injection nlp

2 citations PDF

tool arXiv Nov 24, 2025 · Nov 2025

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Yixin Wu, Rui Wen, Chi Cui et al. · CISPA Helmholtz Center for Information Security · Institute of Science Tokyo

Autonomous LLM agent automates membership inference, model stealing, and data reconstruction attacks on ML services with near-expert accuracy at $0.627/run.

Membership Inference Attack Model Theft Model Inversion Attack Red-Team Agents Triage & Prioritization nlp

PDF

attack arXiv Nov 10, 2025 · Nov 2025

On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch et al. · Polish Academy of Sciences · IDEAS NCBR +5 more

Steals GNN models with as few as 100 queries by decoupling query-free backbone extraction from strategic head extraction

Model Theft graph

PDF Code

tool arXiv Oct 31, 2025 · Oct 2025

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang et al. · CISPA Helmholtz Center for Information Security

Multi-agent forensic framework with LLM debate and memory module achieves 97% accuracy on AI-generated image detection

Output Integrity Attack Blue-Team Agents visionnlp

1 citations PDF

attack arXiv Oct 24, 2025 · Oct 2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes et al. · CISPA Helmholtz Center for Information Security

Jailbreaks LLMs by interleaving harmful and benign task words, hiding malicious intent from safety guardrails with 95% attack success rate

Prompt Injection nlp

9 citations 1 influentialPDF Code

defense arXiv Oct 24, 2025 · Oct 2025

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes et al. · CISPA Helmholtz Center for Information Security · Google DeepMind +1 more

Defends LLM agents against indirect prompt injection via iterative sanitization, limiting adversarial attack success rate to 15%

Prompt Injection nlp

2 citations PDF

Loading more papers…

Latest papers

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Conditioned Activation Transport for T2I Safety Steering

Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction

Turning Black Box into White Box: Dataset Distillation Leaks

Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Hardware-Triggered Backdoors

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Chasing Shadows: Pitfalls in LLM Security Research

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

On Stealing Graph Neural Network Models

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Soft Instruction De-escalation Defense

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue