Latest papers

25 papers
defense arXiv Mar 31, 2026 · 6d ago

Refined Detection for Gumbel Watermarking

Tor Lattimore · Google DeepMind

Near-optimal detection test for Gumbel watermarking of LLM text outputs with problem-dependent statistical efficiency guarantees

Output Integrity Attack nlp
PDF
benchmark arXiv Mar 11, 2026 · 26d ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran · Google DeepMind · University of Illinois Urbana-Champaign

Scaling-law framework comparing four LLM jailbreak paradigms by FLOPs budget, finding prompt-based attacks dominate compute efficiency

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Mar 4, 2026 · 4w ago

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser et al. · UC Berkeley · Google +1 more

Defends multimodal web agents against cross-modal DOM injection attacks using adversarial self-play RL across visual and text channels

Prompt Injection Excessive Agency multimodalreinforcement-learning
PDF
defense arXiv Feb 23, 2026 · 6w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp
PDF
attack arXiv Feb 9, 2026 · 8w ago

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection nlp
PDF Code
defense arXiv Feb 9, 2026 · 8w ago

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Feb 8, 2026 · 8w ago

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis et al. · Google Cloud AI Research · Seoul National University +2 more

Defends LLM tool-calling agents against indirect prompt injection via causal attribution-based dominance shift detection at privileged action points

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Feb 3, 2026 · 8w ago

Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning

Andrew Draganov, Tolga H. Dur, Anandmayi Bhongade et al. · LASR Labs · Google DeepMind

Data poisoning attack that survives paraphrasing and filtering, planting password-triggered backdoors in LLMs including GPT-4.1

Data Poisoning Attack Model Poisoning nlp
PDF
attack arXiv Jan 27, 2026 · 9w ago

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al. · Northeastern University · University of Cambridge +4 more

Poisons LLM CoT training data by corrupting reasoning traces to inject targeted behaviors into unseen domains without altering queries or answers

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Jan 19, 2026 · 11w ago

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Johannes Kaiser, Alexander Ziller, Eleni Triantafillou et al. · Technical University of Munich · University of Potsdam +2 more

Exposes collusion vulnerability in iDP where adversaries manipulate others' privacy budgets to amplify membership inference attacks on targeted individuals

Membership Inference Attack
PDF
defense arXiv Jan 16, 2026 · 11w ago

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang et al. · Google DeepMind

Deploys activation probe classifiers in Gemini to intercept cyber-offensive misuse, solving long-context generalization and adaptive adversarial evasion

Prompt Injection nlp
3 citations PDF
defense arXiv Dec 22, 2025 · Dec 2025

WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Utae Jeong, Sumin In, Hyunju Ryu et al. · Korea University · Google DeepMind +1 more

Defends image watermark provenance against image-to-video conversion using optical-flow consistency and diffusion-proxy training

Output Integrity Attack visiongenerative
PDF
defense arXiv Nov 26, 2025 · Nov 2025

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang et al. · Lund University · Google DeepMind

Defends multi-modal embeddings against adversarial illusions using VAE reconstruction and consensus aggregation, reducing attack success to near-zero

Input Manipulation Attack multimodalvision
PDF
tool arXiv Nov 16, 2025 · Nov 2025

SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong et al. · Google LLC · University of California +1 more

Attributes AI-generated videos to their source generator model with multi-granular forensic detail and 0.5% labeled data

Output Integrity Attack visiongenerative
PDF
defense arXiv Nov 12, 2025 · Nov 2025

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Tiansheng Huang, Virat Shejwalkar, Oscar Chang et al. · Georgia Institute of Technology · Google DeepMind +1 more

Defends audio language models against representation-drift-based audio jailbreaks using robust reasoning training

Input Manipulation Attack Prompt Injection audionlp
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes et al. · CISPA Helmholtz Center for Information Security · Google DeepMind +1 more

Defends LLM agents against indirect prompt injection via iterative sanitization, limiting adversarial attack success rate to 15%

Prompt Injection nlp
2 citations PDF
benchmark arXiv Oct 21, 2025 · Oct 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner et al. · ETH Zürich · ML Alignment & Theory Scholars +1 more

Stress-tests CoT safety monitoring: reasoning models can hide malicious intent via prompt-induced obfuscation, collapsing detection from 96% to ~10%

Prompt Injection nlp
6 citations PDF Code
attack arXiv Oct 21, 2025 · Oct 2025

Extracting alignment data in open models

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo et al. · University of Oxford · National University of Singapore +4 more

Extracts LLM alignment training data via chat template prompting, finding embedding similarity reveals 10x more memorization than string matching

Model Inversion Attack Sensitive Information Disclosure nlp
4 citations PDF
benchmark arXiv Oct 18, 2025 · Oct 2025

Scaling Laws for Deepfake Detection

Wenhao Wang, Longqi Cai, Taihong Xiao et al. · University of Technology Sydney · Google DeepMind

Discovers power-law scaling laws for deepfake detection using ScaleDF, the largest dataset with 14M+ images across 51 real domains and 102 generation methods

Output Integrity Attack visiongenerative
1 citations PDF
benchmark arXiv Oct 10, 2025 · Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin et al. · OpenAI · Anthropic +6 more

Adaptive attacks via gradient descent, RL, and random search bypass 12 LLM jailbreak/prompt-injection defenses with >90% success rate

Input Manipulation Attack Prompt Injection nlp
34 citations 4 influentialPDF
Loading more papers…