ML Security Papers

Latest papers

15 papers

benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative

PDF Code

defense arXiv Mar 5, 2026 · 4w ago

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Zhihao Li, Gezheng Xu, Jiale Cai et al. · Western University · Concordia University +2 more

Proposes BAIT, a bi-level optimization that makes availability-poisoning data protection robust against pretrained model fine-tuning

Data Poisoning Attack vision

PDF Code

benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp

3 citations PDF

defense arXiv Feb 21, 2026 · 6w ago

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati, Xijie Zeng, Hong Huang et al. · Dalhousie University · Vector Institute +1 more

Defends open-weight models against harmful fine-tuning via spectral reparameterization, proving adaptive adversaries can bypass any such defense at linear model-size cost

Transfer Learning Attack visionnlp

PDF

tool arXiv Feb 13, 2026 · 7w ago

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp

PDF

benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp

1 citations PDF Code

defense arXiv Feb 4, 2026 · 8w ago

Cascading Robustness Verification: Toward Efficient Model-Agnostic Certification

Mohammadreza Maleki, Rushendra Sidibomma, Arman Adibi et al. · Toronto Metropolitan University · University of Minnesota Twin-Cities +2 more

Cascading verifier framework certifies neural network robustness against adversarial examples with 90% runtime reduction over single-verifier baselines

Input Manipulation Attack vision

PDF

Certifying neural network robustness against adversarial examples is challenging, as formal guarantees often require solving non-convex problems. Hence, incomplete verifiers are widely used because they scale efficiently and substantially reduce the cost of robustness verification compared to complete methods. However, relying on a single verifier can underestimate robustness because of loose approximations or misalignment with training methods. In this work, we propose Cascading Robustness Verification (CRV), which goes beyond an engineering improvement by exposing fundamental limitations of existing robustness metric and introducing a framework that enhances both reliability and efficiency. CRV is a model-agnostic verifier, meaning that its robustness guarantees are independent of the model's training process. The key insight behind the CRV framework is that, when using multiple verification methods, an input is certifiably robust if at least one method certifies it as robust. Rather than relying solely on a single verifier with a fixed constraint set, CRV progressively applies multiple verifiers to balance the tightness of the bound and computational cost. Starting with the least expensive method, CRV halts as soon as an input is certified as robust; otherwise, it proceeds to more expensive methods. For computationally expensive methods, we introduce a Stepwise Relaxation Algorithm (SR) that incrementally adds constraints and checks for certification at each step, thereby avoiding unnecessary computation. Our theoretical analysis demonstrates that CRV achieves equal or higher verified accuracy compared to powerful but computationally expensive incomplete verifiers in the cascade, while significantly reducing verification overhead. Empirical results confirm that CRV certifies at least as many inputs as benchmark approaches, while improving runtime efficiency by up to ~90%.

cnn transformer Toronto Metropolitan University · University of Minnesota Twin-Cities · Augusta University +1 more

PDF arXiv DOI

defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp

PDF

defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal

1 citations PDF

benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp

PDF

benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular

1 citations PDF Code

defense arXiv Oct 6, 2025 · Oct 2025

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri et al. · ServiceNow Research · Mila - Québec AI Institute +3 more

Modular agent-tool firewall achieves perfect indirect prompt injection defense on four benchmarks, while exposing those benchmarks as too weak

Prompt Injection nlp

4 citations PDF

benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp

PDF Code

benchmark arXiv Sep 8, 2025 · Sep 2025

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

William Xu, Yiwei Lu, Yihan Wang et al. · University of Waterloo · University of Ottawa +3 more

Introduces three metrics—ergodic prediction accuracy, poison distance, and budget—to predict which test instances are most vulnerable to targeted data poisoning

Data Poisoning Attack vision

PDF

benchmark arXiv Aug 16, 2025 · Aug 2025

Demystifying Foreground-Background Memorization in Diffusion Models

Jimmy Z. Di, Yiwei Lu, Yaoliang Yu et al. · University of Waterloo · Vector Institute +2 more

Proposes FB-Mem segmentation metric to quantify partial training data memorization in diffusion models, showing current mitigations fail for foreground regions

Model Inversion Attack visiongenerative

PDF

Latest papers

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Agents of Chaos

Limits of Convergence-Rate Control for Open-Weight Safety

GPTZero: Robust Detection of LLM-Generated Texts

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Cascading Robustness Verification: Toward Efficient Model-Agnostic Certification

How does information access affect LLM monitors' ability to detect sabotage?

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Are LLMs Good Safety Agents or a Propaganda Engine?

An Investigation of Memorization Risk in Healthcare Foundation Models

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

Demystifying Foreground-Background Memorization in Diffusion Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue