Latest papers

16 papers
benchmark arXiv Mar 16, 2026 · 23d ago

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu et al. · Gray Swan AI · OpenAI +6 more

Large-scale red teaming competition finds all frontier LLM agents vulnerable to concealed indirect prompt injection attacks with 0.5-8.5% success rates

Prompt Injection Excessive Agency nlpmultimodal
PDF
defense arXiv Jan 22, 2026 · 10w ago

Learning to Watermark in the Latent Space of Generative Models

Sylvestre-Alvise Rebuffi, Tuan Tran, Valeriu Lacatusu et al. · Meta · ETH Zürich

Embeds imperceptible watermarks in latent space of diffusion and autoregressive models, enabling 20x faster in-model content provenance tracking

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Dec 23, 2025 · Dec 2025

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos et al. · Meta · University of Tübingen

Defends LLMs against jailbreaks by jointly training an Attacker and Defender LM as a non-cooperative RL game, shifting the safety-utility Pareto frontier

Prompt Injection nlp
1 citations PDF
benchmark arXiv Dec 18, 2025 · Dec 2025

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Pierre Fernandez, Tom Sander, Hady Elsahar et al. · Meta

Benchmarks post-hoc LLM text watermarking strategies across compute budgets, finding Gumbel-max and beam search dominate for prose but fail on code

Output Integrity Attack nlp
PDF
benchmark arXiv Nov 18, 2025 · Nov 2025

Observational Auditing of Label Privacy

Iden Kalemaj, Luca Melis, Maxime Boucher et al. · Meta

Observational DP auditing framework verifies label and attribute privacy without modifying training data, extending beyond membership inference

Model Inversion Attack Membership Inference Attack visiontabular
PDF
defense arXiv Nov 11, 2025 · Nov 2025

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra, Pierce Tillman, Samarth Gaggar et al. · Independent · Algoverse +3 more

Activation steering defense that reduces sensitive user data leakage in LLM chain-of-thought reasoning traces at inference time

Sensitive Information Disclosure nlp
4 citations 1 influentialPDF
tool arXiv Oct 27, 2025 · Oct 2025

PrivacyGuard: A Modular Framework for Privacy Auditing in Machine Learning

Luca Melis, Matthew Grange, Iden Kalemaj et al. · Meta

Open-source PyTorch tool for auditing ML model privacy via membership inference, reconstruction, and extraction attacks

Membership Inference Attack Model Inversion Attack visionnlpgenerative
PDF Code
attack arXiv Oct 23, 2025 · Oct 2025

Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models

Tomáš Souček, Sylvestre-Alvise Rebuffi, Pierre Fernandez et al. · Meta · ETH Zürich

Forges image content watermarks in one shot via a gradient-based preference model, requiring no watermarking model access

Output Integrity Attack vision
PDF Code
defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp
3 citations PDF
attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection nlp
9 citations PDF Code
tool arXiv Oct 3, 2025 · Oct 2025

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Zhaorun Chen, Xun Liu, Mintong Kang et al. · University of Chicago · University of Illinois +2 more

Adaptive agentic red-teaming system jailbreaks VLMs with 11 multimodal attack strategies, exceeding 90% ASR on Claude-4-Sonnet

Input Manipulation Attack Prompt Injection multimodalnlp
1 citations PDF Code
defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp
7 citations PDF
benchmark arXiv Oct 1, 2025 · Oct 2025

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers et al. · University of Maryland - College Park · AWS AI Labs +1 more

Benchmarks jailbreak attacks on code-capable LLM agents, showing agent wrapping raises attack success 1.6x with 32% instantly deployable malicious code

Prompt Injection Excessive Agency nlp
2 citations 1 influentialPDF
benchmark arXiv Sep 25, 2025 · Sep 2025

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun et al. · Northeastern University · MIT +1 more

Exploits learned syntactic-domain correlations to bypass LLM safety refusals via malformed or domain-mismatched prompts

Prompt Injection nlp
2 citations PDF
defense arXiv Sep 18, 2025 · Sep 2025

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Yihao Guo, Haocheng Bian, Liutong Zhou et al. · Apple · Cohere +3 more

Builds a compact 149M-parameter RAG-augmented guard model that detects malicious LLM prompts in real time with GPT-4-level accuracy

Prompt Injection nlp
PDF
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code