Latest papers

26 papers
benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
defense arXiv Feb 22, 2026 · 6w ago

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Gurjot Singh, Prabhjot Singh, Aashima Sharma et al. · University of Waterloo · University of Melbourne +2 more

Post-hoc VLM-assisted framework detects and edits policy-violating content in diffusion model outputs without retraining

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 15, 2026 · 7w ago

Online LLM watermark detection via e-processes

Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more

Proposes anytime-valid e-process framework for sequential LLM watermark detection with theoretical power guarantees

Output Integrity Attack nlp
PDF
tool arXiv Feb 13, 2026 · 7w ago

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp
PDF
benchmark arXiv Feb 13, 2026 · 7w ago

Backdooring Bias in Large Language Models

Anudeep Das, Prach Chantasantitam, Gurjot Singh et al. · University of Waterloo

Analyzes syntactic and semantic backdoor attacks inducing bias in LLMs under a white-box threat model with 1000+ evaluations

Model Poisoning nlp
PDF
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Feb 6, 2026 · 8w ago

Robust Online Learning

Sajad Ashkezari · University of Waterloo

Theoretical framework characterizing robust online learnability via a new Littlestone-like dimension under adversarial input perturbations

Input Manipulation Attack
PDF
defense arXiv Feb 5, 2026 · 8w ago

Private and interpretable clinical prediction with quantum-inspired tensor train models

José Ramón Pareja Monturiol, Juliette Sinnott, Roger G. Melko et al. · Universidad Complutense de Madrid · Instituto de Ciencias Matemáticas +2 more

Defends clinical ML models against membership inference using tensor train obfuscation, reducing white-box attacks to random guessing

Membership Inference Attack tabular
PDF
defense arXiv Jan 31, 2026 · 9w ago

Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber, Hosna Oyarhoseini, Jimmy Lin · University of Waterloo

Unified adversarial training framework for text scoring LMs defending against token-manipulation and content injection attacks including reward hacking

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Jan 15, 2026 · 11w ago

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Jan 5, 2026 · Jan 2026

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen et al. · Zhejiang University · University of Waterloo +2 more

Recovers LLM safety alignment after harmful fine-tuning using a single safety example via low-rank gradient structure

Transfer Learning Attack Prompt Injection nlp
1 citations PDF
defense arXiv Dec 19, 2025 · Dec 2025

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan et al. · University of Waterloo · MBZUAI +1 more

Adapts CLIP with prompt tuning and visual adapters to detect GAN and diffusion deepfakes across 25 diverse test sets

Output Integrity Attack vision
PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more

Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning

Output Integrity Attack Transfer Learning Attack visiongenerative
PDF
benchmark arXiv Nov 14, 2025 · Nov 2025

On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

Lucas Fenaux, Christopher Srinivasa, Florian Kerschbaum · University of Waterloo · Borealis AI

Game-theoretic analysis reveals defense obscurity benefits defenders; existing benchmarks underestimate transferable adversarial attack potency by up to 3.73×

Input Manipulation Attack vision
PDF
defense arXiv Nov 7, 2025 · Nov 2025

MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

Mohammad Karami, Mohammad Reza Nemati, Aidin Kazemi et al. · University of Tehran · Max Planck Institute for Brain Research +2 more

Federated defense combining MAE detection and diffusion purification to protect brain MRI classifiers from adversarial attacks at inference time

Input Manipulation Attack visionfederated-learning
PDF
defense arXiv Oct 14, 2025 · Oct 2025

Locket: Robust Feature-Locking Technique for Language Models

Lipeng He, Vasisht Duddu, N. Asokan · University of Waterloo

Adapter-merging technique locks premium LLM features behind credentials, resisting prompt-based evasion and fine-tuning bypass attacks

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Oct 8, 2025 · Oct 2025

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes, Vasisht Duddu, N. Asokan et al. · University of Sheffield · University of Waterloo

Defends LLMs against PII extraction attacks by identifying and surgically patching memorization circuits, reducing recall by 65%

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
attack arXiv Sep 28, 2025 · Sep 2025

GPM: The Gaussian Pancake Mechanism for Planting Undetectable Backdoors in Differential Privacy

Haochen Sun, Xi He · University of Waterloo

Backdoor DP mechanism indistinguishable from Gaussian Mechanism silently degrades privacy, enabling near-perfect membership inference attacks

AI Supply Chain Attacks Membership Inference Attack
PDF
tool arXiv Sep 15, 2025 · Sep 2025

Amulet: a Python Library for Assessing Interactions Among ML Defenses and Risks

Asim Waheed, Vasisht Duddu, Rui Zhang et al. · University of Waterloo · Zhejiang University +1 more

Open-source Python library revealing unintended cross-risk tradeoffs when combining ML defenses against adversarial, privacy, and fairness threats

Input Manipulation Attack Membership Inference Attack visiontabular
PDF
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code
Loading more papers…