Latest papers

24 papers
benchmark arXiv Mar 11, 2026 · 28d ago

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan et al. · Georgia Institute of Technology · Stanford University +1 more

Exposes LLM unlearning brittleness by showing multi-hop and alias queries recover supposedly forgotten information missed by static benchmarks

Sensitive Information Disclosure nlp
PDF Code
attack arXiv Mar 6, 2026 · 4w ago

Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

Eitan Shaar, Ariel Shaulov, Yalcin Tur et al. · Yalcin Tur · Tel-Aviv University +4 more

Transfer adversarial attack optimizing in Stable Diffusion VAE latent space for low-frequency, cross-architecture-transferable perturbations

Input Manipulation Attack vision
PDF
defense arXiv Mar 3, 2026 · 5w ago

Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian et al. · Tsinghua University · Georgia Tech +2 more

RL-trained instructor model provides context-aware privacy guidance to LLM agents, preventing sensitive data disclosure with 94.2% preservation rate

Sensitive Information Disclosure Prompt Injection nlp
PDF
benchmark arXiv Mar 3, 2026 · 5w ago

Solving adversarial examples requires solving exponential misalignment

Alessandro Salvatore, Stanislav Fort, Surya Ganguli · Stanford University

Geometric analysis shows machine perceptual manifolds are exponentially higher-dimensional than human concepts, explaining why adversarial examples exist

Input Manipulation Attack vision
PDF
benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp
PDF
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
tool arXiv Feb 10, 2026 · 8w ago

ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs

James Burgess, Rameen Abdal, Dan Stoddart et al. · Stanford University · Snap Inc.

Detects AI-generated image artifacts with VLMs using hundreds of labeled examples via in-context learning and prompt optimization

Output Integrity Attack visionmultimodalgenerative
PDF Code
defense arXiv Jan 24, 2026 · 10w ago

Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes

Gautam Siddharth Kashyap, Harsh Joshi, Niharika Jain et al. · Macquarie University · Bharati Vidyapeeth’s College Of Engineering +4 more

Proposes ConLLM, a contrastive learning + LLM framework for detecting audio, video, and audio-visual deepfakes

Output Integrity Attack multimodalaudiovisionnlp
PDF Code
attack arXiv Jan 6, 2026 · Jan 2026

Extracting books from production language models

Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo et al. · Stanford University · Yale University

Extracts copyrighted books near-verbatim from Claude, GPT-4.1, Gemini, and Grok using Best-of-N jailbreaks and iterative continuation prompts

Model Inversion Attack Sensitive Information Disclosure Prompt Injection nlp
5 citations PDF
attack arXiv Dec 12, 2025 · Dec 2025

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness, Alex Serrano, Luke Bailey et al. · MATS · UC Berkeley +1 more

Fine-tuning embeds trigger-activated backdoor enabling LLMs to zero-shot evade unseen activation safety monitors

Model Poisoning Prompt Injection nlp
2 citations PDF Code
defense arXiv Dec 4, 2025 · Dec 2025

A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

Jikang Cheng, Renye Yan, Zhiyuan Yan et al. · Peking University · Nanjing University +3 more

Proposes DevDet framework that amplifies real/fake differences over domain signals for robust multi-domain deepfake detection

Output Integrity Attack vision
PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

James R. M. Black, Moritz S. Hanke, Aaron Maiwald et al. · Johns Hopkins Bloomberg School of Public Health · University of Oxford +1 more

Adversarial fine-tuning on viral sequences bypasses data-exclusion safety filtering in an open-weight genomic language model, restoring restricted capabilities

Transfer Learning Attack nlpgenerative
3 citations PDF
defense arXiv Nov 23, 2025 · Nov 2025

When Generative Replay Meets Evolving Deepfakes: Domain-Aware Relative Weighting for Incremental Face Forgery Detection

Hao Shen, Jikang Cheng, Renye Yan et al. · Huazhong Agricultural University · Peking University +2 more

Proposes DARW to improve incremental deepfake detection via domain-aware generative replay that separates safe from risky synthesized samples

Output Integrity Attack visiongenerative
PDF
attack arXiv Oct 30, 2025 · Oct 2025

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer et al. · Independent Researcher · Stanford University +3 more

Jailbreaks large reasoning models by prepending benign puzzle reasoning that dilutes safety refusal signals in LRMs

Prompt Injection nlp
3 citations PDF
defense arXiv Oct 22, 2025 · Oct 2025

Blackbox Model Provenance via Palimpsestic Membership Inference

Rohith Kuditipudi, Jing Huang, Sally Zhu et al. · Stanford University

Proves LLM model provenance by correlating derivative model outputs with base training-data order via palimpsestic memorization

Model Theft Output Integrity Attack nlp
2 citations PDF
defense arXiv Oct 13, 2025 · Oct 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt · Ludwig-Maximilians-Universität München · Munich Center for Machine Learning +1 more

RL fine-tuning steers LLM outputs away from safety classifier margins to reduce jailbreak bypass and over-refusal simultaneously

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Oct 1, 2025 · Oct 2025

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan et al. · ETH Zürich · Stanford University

Proves adversarial transfer depends on attack domain: data-space attacks cross model boundaries, representation-space attacks don't

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
benchmark arXiv Sep 26, 2025 · Sep 2025

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more

Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness

Output Integrity Attack visionmultimodalgenerative
2 citations 2 influentialPDF
benchmark arXiv Sep 3, 2025 · Sep 2025

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin et al. · Peking University · Stanford University +3 more

Red-teams protein foundation models via multimodal prompt engineering and beam search, achieving 70% jailbreak success rate bypassing ESM3 safety filters

Prompt Injection nlpgenerative
PDF Code
attack arXiv Sep 2, 2025 · Sep 2025

Ensembling Membership Inference Attacks Against Tabular Generative Models

Joshua Ward, Yuxuan Yang, Chi-Hua Wang et al. · University of California Los Angeles · Stanford University

Ensembles multiple membership inference attacks against tabular synthetic data generators, achieving more robust privacy auditing than any single MIA strategy

Membership Inference Attack tabulargenerative
PDF Code
Loading more papers…