Latest papers

15 papers
benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
defense arXiv Mar 5, 2026 · 4w ago

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Zhihao Li, Gezheng Xu, Jiale Cai et al. · Western University · Concordia University +2 more

Proposes BAIT, a bi-level optimization that makes availability-poisoning data protection robust against pretrained model fine-tuning

Data Poisoning Attack vision
PDF Code
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
defense arXiv Feb 21, 2026 · 6w ago

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati, Xijie Zeng, Hong Huang et al. · Dalhousie University · Vector Institute +1 more

Defends open-weight models against harmful fine-tuning via spectral reparameterization, proving adaptive adversaries can bypass any such defense at linear model-size cost

Transfer Learning Attack visionnlp
PDF
tool arXiv Feb 13, 2026 · 7w ago

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp
PDF
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Feb 4, 2026 · 8w ago

Cascading Robustness Verification: Toward Efficient Model-Agnostic Certification

Mohammadreza Maleki, Rushendra Sidibomma, Arman Adibi et al. · Toronto Metropolitan University · University of Minnesota Twin-Cities +2 more

Cascading verifier framework certifies neural network robustness against adversarial examples with 90% runtime reduction over single-verifier baselines

Input Manipulation Attack vision
PDF
defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular
1 citations PDF Code
defense arXiv Oct 6, 2025 · Oct 2025

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri et al. · ServiceNow Research · Mila - Québec AI Institute +3 more

Modular agent-tool firewall achieves perfect indirect prompt injection defense on four benchmarks, while exposing those benchmarks as too weak

Prompt Injection nlp
4 citations PDF
benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp
PDF Code
benchmark arXiv Sep 8, 2025 · Sep 2025

Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

William Xu, Yiwei Lu, Yihan Wang et al. · University of Waterloo · University of Ottawa +3 more

Introduces three metrics—ergodic prediction accuracy, poison distance, and budget—to predict which test instances are most vulnerable to targeted data poisoning

Data Poisoning Attack vision
PDF
benchmark arXiv Aug 16, 2025 · Aug 2025

Demystifying Foreground-Background Memorization in Diffusion Models

Jimmy Z. Di, Yiwei Lu, Yaoliang Yu et al. · University of Waterloo · Vector Institute +2 more

Proposes FB-Mem segmentation metric to quantify partial training data memorization in diffusion models, showing current mitigations fail for foreground regions

Model Inversion Attack visiongenerative
PDF