Latest papers

9 papers
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
attack arXiv Feb 22, 2026 · 6w ago

Understanding Empirical Unlearning with Combinatorial Interpretability

Shingo Kodama, Niv Cohen, Micah Adler et al. · Middlebury College · New York University +2 more

Attacks machine unlearning methods using combinatorial interpretability, showing erased knowledge persists in weights and recovers rapidly via fine-tuning

Model Inversion Attack nlpvision
PDF
defense arXiv Feb 19, 2026 · 6w ago

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

Arka Pal, Louai Zahran, William Gvozdjak et al. · Ritual · MIT +1 more

Leverages SMPC/FHE privacy mechanisms to build cheap verifiable LLM inference protocols, replacing costly zero-knowledge proofs

Output Integrity Attack nlp
PDF Code
attack arXiv Feb 10, 2026 · 7w ago

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette et al. · University of Oxford · Independent +2 more

Poisons ML models by perturbing existing training data via influence functions, inducing targeted behavior without injecting explicit attack examples

Data Poisoning Attack Training Data Poisoning visionnlp
PDF Code
defense arXiv Jan 20, 2026 · 10w ago

PAC-Private Responses with Adversarial Composition

Xiaochen Zhu, Mayuri Sridhar, Srinivas Devadas · MIT

Defends ML APIs against membership inference by applying PAC privacy to model outputs with proven adversarial composition guarantees

Membership Inference Attack visionnlptabular
PDF
defense arXiv Dec 6, 2025 · Dec 2025

Delete and Retain: Efficient Unlearning for Document Classification

Aadya Goel, Mayuri Sridhar · Mayuri Sridhar · Acton-Boxborough Regional High School +1 more

Defends document classifiers against membership inference via Hessian-based class-level unlearning with deterministic top-1 reassignment

Membership Inference Attack nlp
PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

Ethan Hsu, Harry Chen, Chudi Zhong et al. · Duke University · MIT +2 more

Analyzes how Rashomon set diversity improves adversarial robustness but increases training data leakage via a proven robustness-privacy trade-off

Input Manipulation Attack Model Inversion Attack tabular
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular
1 citations PDF Code
benchmark arXiv Sep 25, 2025 · Sep 2025

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun et al. · Northeastern University · MIT +1 more

Exploits learned syntactic-domain correlations to bypass LLM safety refusals via malformed or domain-mismatched prompts

Prompt Injection nlp
2 citations PDF