ML Security Papers

Latest papers

17 papers

benchmark arXiv Apr 21, 2026 · 4w ago

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Kun Wang, Cheng Qian, Miao Yu et al. · Nanyang Technological University · University of Science and Technology of China +3 more

Interpretability framework revealing that MLLM backdoors encode in low-rank projector subspaces with norm-scaled activation mechanisms

Model Poisoning multimodalnlpvision

PDF Code

survey arXiv Mar 31, 2026 · 7w ago

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Almasoud, Antony Anju, Marco Arazzi et al. · arXiv · University of Pavia +1 more

First comprehensive survey organizing 45 studies on security risks of LLM-as-a-Judge systems including adversarial manipulation and evaluation vulnerabilities

Prompt Injection nlp

PDF

attack arXiv Mar 10, 2026 · 10w ago

CLIOPATRA: Extracting Private Information from LLM Insights

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter Kairouz · arXiv · University College London +1 more

Attacks Anthropic's Clio LLM analytics platform by injecting crafted chats to extract private medical history of target users, bypassing layered privacy protections

Sensitive Information Disclosure Prompt Injection nlp

PDF Code

defense arXiv Mar 2, 2026 · 11w ago

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu et al. · arXiv · The Hong Kong University of Science and Technology +1 more

Detects backdoor and prompt injection attacks in black-box LLMs by monitoring token entropy lulls during generation

Model Poisoning Prompt Injection nlp

PDF Code

defense arXiv Feb 21, 2026 · 12w ago

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Chengwei Xia, Fan Ma, Ruijie Quan et al. · Lanzhou University · arXiv +2 more

Adversarially-optimized trigger images that verify MLLM copyright by eliciting ownership text only in fine-tuned derivatives

Model Theft Model Theft multimodalnlp

PDF

defense arXiv Feb 12, 2026 · Feb 2026

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Oliver Zahn, Matt Beton, Simran Chana · arXiv · Independent Researcher +1 more

Defends against data poisoning via contributor-reputation-weighted training, outperforming Byzantine-robust baselines under joint credential-faking and gradient-alignment attacks

Data Poisoning Attack tabularfederated-learning

PDF

attack arXiv Dec 9, 2025 · Dec 2025

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Sampriti Soor, Suklav Ghosh, Arijit Sur · arXiv · Indian Institute of Technology Guwahati

RL-trained adversarial suffixes degrade LLM classification accuracy using PPO and calibrated cross-entropy, outperforming gradient-based triggers in transferability

Input Manipulation Attack nlp

PDF

benchmark arXiv Dec 1, 2025 · Dec 2025

Securing Large Language Models (LLMs) from Prompt Injection Attacks

Omar Farooq Khan Suri, John McCrae · arXiv · University of Galway

Evaluates JATMO fine-tuning defense against HOUYI genetic prompt injection attacks on LLaMA 2 and Qwen, finding incomplete protection

Prompt Injection nlp

PDF

defense arXiv Nov 20, 2025 · Nov 2025

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

Huseein Jawad, Nicolas Brunel · arXiv · Capgemini Invent +2 more

Defends LLM system prompts against extraction attacks by appending optimized textual shields via black-box LLM-guided optimization

Sensitive Information Disclosure Prompt Injection nlp

PDF Code

attack arXiv Nov 13, 2025 · Nov 2025

Enhanced Privacy Leakage from Noise-Perturbed Gradients via Gradient-Guided Conditional Diffusion Models

Jiayang Meng, Tao Huang, Hong Chen et al. · arXiv · Renmin University of China +1 more

Diffusion model-guided gradient inversion attack that reconstructs private images from noise-perturbed FL gradients, bypassing a common defense

Model Inversion Attack visionfederated-learning

PDF

tool arXiv Oct 21, 2025 · Oct 2025

Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Chia-Hsuan Lu, Tony Tan, Michael Benedikt · arXiv · University of Oxford +1 more

Verifies GNN robustness against structural adversarial perturbations using polynomial-time partial SAT solvers instead of MIP

Input Manipulation Attack graph

1 citations PDF Code

defense arXiv Oct 20, 2025 · Oct 2025

Fair and Interpretable Deepfake Detection in Videos

Akihito Yoshii, Ryosuke Sonoda, Ramya Srinivasan · arXiv · Fujitsu Limited

Fairness-aware deepfake video detector combining temporal clustering, concept explainability, and frequency-domain augmentation to reduce demographic bias

Output Integrity Attack vision

PDF

benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp

8 citations 2 influentialPDF Code

defense arXiv Sep 29, 2025 · Sep 2025

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang et al. · arXiv · Shanghai Qi Zhi Institute +3 more

Defends Large Reasoning Models against jailbreaks by aligning CoT safety via process-supervised preference optimization with corrective interventions

Prompt Injection nlp

2 citations 1 influentialPDF

defense arXiv Sep 17, 2025 · Sep 2025

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo · arXiv · Stony Brook University

Defends diffusion models against membership inference attacks using critically-damped higher-order Langevin dynamics to corrupt sensitive training data earlier in the diffusion process

Membership Inference Attack generativeaudio

PDF

benchmark arXiv Aug 28, 2025 · Aug 2025

FakeParts: a New Family of AI-Generated DeepFakes

Ziyi Liu, Firas Gabetni, Awais Hussain Sani et al. · arXiv · Institut Polytechnique de Paris

Benchmark dataset of 81K partial deepfake videos exposing critical blind spots in state-of-the-art deepfake detectors

Output Integrity Attack visiongenerative

PDF

defense arXiv Aug 8, 2025 · Aug 2025

Quantifying Conversation Drift in MCP via Latent Polytope

Haoran Shi, Hongwei Yao, Shuo Shao et al. · arXiv · Zhejiang University +3 more

Defends LLM-MCP tool integrations against indirect prompt injection by detecting adversarial conversation drift in latent polytope space

Insecure Plugin Design Prompt Injection nlp

PDF

Latest papers

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Security in LLM-as-a-Judge: A Comprehensive SoK

CLIOPATRA: Extracting Private Information from LLM Insights

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Securing Large Language Models (LLMs) from Prompt Injection Attacks

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

Enhanced Privacy Leakage from Noise-Perturbed Gradients via Gradient-Guided Conditional Diffusion Models

Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Fair and Interpretable Deepfake Detection in Videos

Eliciting Secret Knowledge from Language Models

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

FakeParts: a New Family of AI-Generated DeepFakes

Quantifying Conversation Drift in MCP via Latent Polytope

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue