Latest papers

29 papers
attack arXiv Mar 12, 2026 · 25d ago

Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner et al. · Georgia Tech · The University of Texas at Austin +3 more

Compounds Rowhammer hardware faults and RAG database injection with LLM attacks to jailbreak guardrails and exfiltrate user data

Prompt Injection Sensitive Information Disclosure nlp
PDF
defense arXiv Mar 3, 2026 · 4w ago

Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian et al. · Tsinghua University · Georgia Tech +2 more

RL-trained instructor model provides context-aware privacy guidance to LLM agents, preventing sensitive data disclosure with 94.2% preservation rate

Sensitive Information Disclosure Prompt Injection nlp
PDF
benchmark arXiv Feb 28, 2026 · 5w ago

The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents

Shrey Shah, Levent Ozgur · Microsoft

Benchmark revealing frontier LLMs catastrophically fail when a single misinformation article tops web search results, despite access to truthful sources

Input Manipulation Attack Prompt Injection nlp
PDF
survey arXiv Feb 21, 2026 · 6w ago

Media Integrity and Authentication: Status, Directions, and Futures

Jessica Young, Sam Vaughan, Andrew Jenks et al. · Microsoft

Surveys provenance, watermarking, and fingerprinting for media authentication, analyzing reversal attacks that subvert AI-content detection systems

Output Integrity Attack visionaudiomultimodal
PDF
defense arXiv Feb 11, 2026 · 7w ago

Optimizing Agent Planning for Security and Autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa et al. · Microsoft · EPFL +1 more

Defends AI agents against indirect prompt injection via security-aware planning that maximizes autonomous operation without human oversight

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Feb 5, 2026 · 8w ago

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines et al. · Microsoft

Uses GRPO reinforcement fine-tuning with a single prompt to strip safety alignment from LLMs and diffusion models, outperforming prior unalignment attacks

Transfer Learning Attack Prompt Injection nlpgenerative
PDF
defense arXiv Feb 3, 2026 · 8w ago

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

Blake Bullwinkel, Giorgio Severi, Keegan Hines et al. · Microsoft

Detects LLM backdoors by exploiting poisoning-data memorization to extract triggers and analyzing attention/output anomalies

Model Poisoning nlp
PDF
attack arXiv Jan 30, 2026 · 9w ago

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin et al. · University of Oxford · The Chinese University of Hong Kong +2 more

Discovers D-LLMs' intrinsic jailbreak resistance, then breaks it with context nesting prompts achieving SOTA attack rates

Prompt Injection nlp
PDF
benchmark arXiv Jan 26, 2026 · 10w ago

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Alexandra Chouldechova, A. Feder Cooper, Solon Barocas et al. · Microsoft Research · Microsoft

Critiques LLM jailbreak ASR comparisons as methodologically invalid using social science measurement theory and inferential statistics

Prompt Injection nlp
1 citations PDF
defense arXiv Jan 21, 2026 · 10w ago

Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection

Yingsong Huang, Hui Guo, Jing Huang et al. · Tencent Inc. · Hikvision +1 more

Detects diffusion-generated images by disentangling epistemic uncertainty via Laplace approximation and asymmetric loss training

Output Integrity Attack visiongenerative
1 citations PDF
defense arXiv Dec 19, 2025 · Dec 2025

AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs

Madhava Gaikwad · Microsoft

Defends LLM training data from extraction attacks using rarity-aware hybrid DP combining PAC shielding and RAPPOR

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

Gauri Pradhan, Joonas Jälkö, Santiago Zanella-Bèguelin et al. · University of Helsinki · Microsoft

Canary-based audit attacks reveal add/remove DP accounting dangerously overstates label/attribute privacy for DP-SGD-trained models

Model Inversion Attack nlp
PDF
benchmark arXiv Nov 7, 2025 · Nov 2025

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Amr Gomaa, Ahmed Salem, Sahar Abdelnabi · German Research Center for Artificial Intelligence · Microsoft +3 more

Benchmarks privacy leakage and prompt-injection-style attacks across 864 multi-turn agent-to-agent LLM conversations in three domains

Prompt Injection Sensitive Information Disclosure nlp
5 citations 2 influentialPDF Code
attack arXiv Nov 5, 2025 · Nov 2025

Whisper Leak: a side-channel attack on Large Language Models

Geoff McDonald, Jonathan Bar Or · Microsoft

Side-channel attack infers LLM user query topics from encrypted traffic metadata, achieving >98% AUPRC across 28 commercial models

Sensitive Information Disclosure nlp
PDF
defense arXiv Nov 1, 2025 · Nov 2025

Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang, Tong Zhang, Jianmin Bao et al. · EPFL · Microsoft +1 more

Detects AI-generated fake images by exploiting hierarchical image-text misalignment in CLIP's visual-language space

Output Integrity Attack visionmultimodal
PDF
attack arXiv Oct 30, 2025 · Oct 2025

SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar et al. · University of California · Microsoft

Red-teaming framework attacks LLM agents via diverse seed generation and iterative adversarial prompts, with distilled 8B model surpassing DeepSeek-R1 671B on attack success rate

Prompt Injection Excessive Agency nlp
1 citations PDF
defense arXiv Oct 28, 2025 · Oct 2025

SLIP-SEC: Formalizing Secure Protocols for Model IP Protection

Racchit Jain, Satya Lokam, Yehonathan Refael et al. · Microsoft

Formally proves that split hybrid LLM inference protocols prevent model weight theft on untrusted devices with information-theoretic guarantees

Model Theft Model Theft nlp
PDF
attack arXiv Oct 24, 2025 · Oct 2025

$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Kieu Dang, Phung Lai, NhatHai Phan et al. · University at Albany · New Jersey Institute of Technology +2 more

LDP noise injection during fine-tuning steals LLM behavior from APIs while evading watermark detectors, achieving 96.95% attack success rate

Model Theft Output Integrity Attack Model Theft nlp
2 citations PDF Code
defense arXiv Oct 20, 2025 · Oct 2025

Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

Rishi Jha, Harold Triedman, Justin Wagle et al. · Cornell University · Microsoft

Breaks alignment-based defenses for LLM multi-agent control-flow hijacking and proposes ControlValve using control-flow graphs and least privilege

Prompt Injection Excessive Agency nlp
3 citations PDF
attack arXiv Oct 16, 2025 · Oct 2025

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Andrew Zhao, Reshmi Ghosh, Vitor Carvalho et al. · Tsinghua University · Microsoft

Discovers LLM prompt optimizers are highly vulnerable to feedback poisoning, introducing a fake reward attack that raises harmful ASR by 0.48

Data Poisoning Attack Prompt Injection nlp
1 citations PDF
Loading more papers…