ML Security Papers

Latest papers

28 papers

defense arXiv Apr 22, 2026 · 29d ago

Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

Aljalila Aladawi, Mohammed Talha Alam, Fakhri Karray · Mohamed bin Zayed University of Artificial Intelligence · University of Waterloo

Defends unlearned diffusion models against concept revival during fine-tuning by projecting gradients onto retain concept subspaces

Model Inversion Attack visiongenerative

PDF

defense arXiv Apr 8, 2026 · 6w ago

Towards Robust Content Watermarking Against Removal and Forgery Attacks

Yifan Zhu, Yihan Wang, Xiao-Shan Gao · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Instance-specific watermarking defense for diffusion models resisting removal and forgery attacks via dynamic injection and two-sided detection

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Mar 19, 2026 · 9w ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative

PDF Code

defense arXiv Feb 22, 2026 · 12w ago

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Gurjot Singh, Prabhjot Singh, Aashima Sharma et al. · University of Waterloo · University of Melbourne +2 more

Post-hoc VLM-assisted framework detects and edits policy-violating content in diffusion model outputs without retraining

Output Integrity Attack visiongenerative

PDF

defense arXiv Feb 15, 2026 · Feb 2026

Online LLM watermark detection via e-processes

Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more

Proposes anytime-valid e-process framework for sequential LLM watermark detection with theoretical power guarantees

Output Integrity Attack nlp

PDF

tool arXiv Feb 13, 2026 · Feb 2026

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp

PDF

benchmark arXiv Feb 13, 2026 · Feb 2026

Backdooring Bias in Large Language Models

Anudeep Das, Prach Chantasantitam, Gurjot Singh et al. · University of Waterloo

Analyzes syntactic and semantic backdoor attacks inducing bias in LLMs under a white-box threat model with 1000+ evaluations

Model Poisoning nlp

PDF

benchmark arXiv Feb 6, 2026 · Feb 2026

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp

1 citations PDF Code

benchmark arXiv Feb 6, 2026 · Feb 2026

Robust Online Learning

Sajad Ashkezari · University of Waterloo

Theoretical framework characterizing robust online learnability via a new Littlestone-like dimension under adversarial input perturbations

Input Manipulation Attack

PDF

defense arXiv Feb 5, 2026 · Feb 2026

Private and interpretable clinical prediction with quantum-inspired tensor train models

José Ramón Pareja Monturiol, Juliette Sinnott, Roger G. Melko et al. · Universidad Complutense de Madrid · Instituto de Ciencias Matemáticas +2 more

Defends clinical ML models against membership inference using tensor train obfuscation, reducing white-box attacks to random guessing

Membership Inference Attack tabular

PDF

defense arXiv Jan 31, 2026 · Jan 2026

Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber, Hosna Oyarhoseini, Jimmy Lin · University of Waterloo

Unified adversarial training framework for text scoring LMs defending against token-manipulation and content injection attacks including reward hacking

Input Manipulation Attack Prompt Injection nlp

PDF Code

defense arXiv Jan 15, 2026 · Jan 2026

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp

PDF Code

defense arXiv Jan 5, 2026 · Jan 2026

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen et al. · Zhejiang University · University of Waterloo +2 more

Recovers LLM safety alignment after harmful fine-tuning using a single safety example via low-rank gradient structure

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

defense arXiv Dec 19, 2025 · Dec 2025

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan et al. · University of Waterloo · MBZUAI +1 more

Adapts CLIP with prompt tuning and visual adapters to detect GAN and diffusion deepfakes across 25 diverse test sets

Output Integrity Attack vision

PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more

Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning

Output Integrity Attack Transfer Learning Attack visiongenerative

PDF

benchmark arXiv Nov 14, 2025 · Nov 2025

On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

Lucas Fenaux, Christopher Srinivasa, Florian Kerschbaum · University of Waterloo · Borealis AI

Game-theoretic analysis reveals defense obscurity benefits defenders; existing benchmarks underestimate transferable adversarial attack potency by up to 3.73×

Input Manipulation Attack vision

PDF

defense arXiv Nov 7, 2025 · Nov 2025

MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

Mohammad Karami, Mohammad Reza Nemati, Aidin Kazemi et al. · University of Tehran · Max Planck Institute for Brain Research +2 more

Federated defense combining MAE detection and diffusion purification to protect brain MRI classifiers from adversarial attacks at inference time

Input Manipulation Attack visionfederated-learning

PDF

defense arXiv Oct 14, 2025 · Oct 2025

Locket: Robust Feature-Locking Technique for Language Models

Lipeng He, Vasisht Duddu, N. Asokan · University of Waterloo

Adapter-merging technique locks premium LLM features behind credentials, resisting prompt-based evasion and fine-tuning bypass attacks

Transfer Learning Attack Prompt Injection nlp

PDF

defense arXiv Oct 8, 2025 · Oct 2025

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes, Vasisht Duddu, N. Asokan et al. · University of Sheffield · University of Waterloo

Defends LLMs against PII extraction attacks by identifying and surgically patching memorization circuits, reducing recall by 65%

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Sep 28, 2025 · Sep 2025

GPM: The Gaussian Pancake Mechanism for Planting Undetectable Backdoors in Differential Privacy

Haochen Sun, Xi He · University of Waterloo

Backdoor DP mechanism indistinguishable from Gaussian Mechanism silently degrades privacy, enabling near-perfect membership inference attacks

AI Supply Chain Attacks Membership Inference Attack

PDF

Loading more papers…

Latest papers

Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

Towards Robust Content Watermarking Against Removal and Forgery Attacks

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Online LLM watermark detection via e-processes

GPTZero: Robust Detection of LLM-Generated Texts

Backdooring Bias in Large Language Models

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Robust Online Learning

Private and interpretable clinical prediction with quantum-inspired tensor train models

Unifying Adversarial Robustness and Training Across Text Scoring Models

Understanding and Preserving Safety in Fine-Tuned LLMs

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

Locket: Robust Feature-Locking Technique for Language Models

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

GPM: The Gaussian Pancake Mechanism for Planting Undetectable Backdoors in Differential Privacy

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue