Latest papers

34 papers
benchmark arXiv Apr 1, 2026 · 7d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp
PDF Code
defense arXiv Mar 24, 2026 · 15d ago

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Rohan Sequeira, Stavros Damianakis, Umar Iqbal et al. · University of Southern California · Washington University in St. Louis

Behavioral bounds framework that blocks malicious tool calls in LLM agents by learning execution patterns and detecting deviations

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Mar 19, 2026 · 20d ago

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao · University of Southern California

Defense training against prompt injection destroys LLM agent tool-use competence, causing 99% timeout rates and 73-86% attack bypass

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Mar 6, 2026 · 4w ago

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin et al. · Sahara AI · University of Southern California

Proposes TEE-based cryptographic proof that AI agent responses passed a specific safety guardrail, preventing false safety claims

Output Integrity Attack Excessive Agency nlp
PDF Code
benchmark arXiv Feb 24, 2026 · 6w ago

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu et al. · Duke University · New York University +3 more

Benchmark revealing that all existing detectors fail to detect diffusion-model-inpainted forgeries in financial documents

Output Integrity Attack vision
1 citations PDF
defense arXiv Feb 16, 2026 · 7w ago

Differentially Private Retrieval-Augmented Generation

Tingting Tang, James Flemings, Yongqin Wang et al. · University of Southern California

Differentially private RAG algorithm that blocks adversarial extraction of sensitive documents from LLM knowledge bases via keyword-based DP output sanitization

Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 10, 2026 · 8w ago

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Zhisheng Qi, Utkarsh Sahu, Li Ma et al. · University of Oregon · Michigan State University +6 more

First systematic benchmark comparing knowledge-extraction attacks and defenses on RAG systems under unified evaluation protocols

Sensitive Information Disclosure nlp
PDF Code
attack arXiv Feb 6, 2026 · 8w ago

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao et al. · Harvard University · Daegu Gyeongbuk Institute of Science and Technology +1 more

RL-based black-box jailbreak framework that reweights historical vulnerability signals to attack LLMs more efficiently

Prompt Injection nlp
PDF
attack arXiv Jan 30, 2026 · 9w ago

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Jiate Li, Defu Cao, Li Li et al. · University of Southern California · Adobe Research +1 more

Black-box query-agnostic adversarial token injection attack manipulates document rankings in RAG and LLM-based retrieval systems using surrogate LLMs

Input Manipulation Attack Prompt Injection nlp
1 citations PDF
attack arXiv Jan 30, 2026 · 9w ago

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Suprim Nakarmi, Junggab Son, Yue Zhao et al. · University of Nevada · University of Southern California

Gradient-based attack infers private label class proportions of federated GNN clients from shared gradients without accessing raw data

Model Inversion Attack graphfederated-learning
PDF Code
attack arXiv Jan 29, 2026 · 9w ago

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr et al. · University of Southern California

Proves static black-box alignment guarantees nothing post-update; constructs LLMs hiding latent jailbreak misalignment triggered by one benign gradient step

Model Poisoning Prompt Injection nlp
1 citations PDF
attack arXiv Jan 20, 2026 · 11w ago

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Bingxin Xu, Yuzhang Shang, Binghui Wang et al. · University of Southern California · University of Central Florida +1 more

Backdoor attack on VLA robotic models exploiting action chunking to inject stealthy malicious trajectories with 93% ASR

Model Poisoning Data Poisoning Attack visionmultimodalreinforcement-learning
1 citations PDF
attack arXiv Jan 18, 2026 · 11w ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du, Chenxiao Yu, Haoyan Xu et al. · Georgetown University · University of Southern California +2 more

Jointly optimizes adversarial image perturbations and gradient-based text suffixes to manipulate VLM-based product search rankings

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp
PDF Code
benchmark arXiv Dec 18, 2025 · Dec 2025

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

Jacob Choi, Shuying Cao, Xingjian Dong et al. · University of Southern California

Canary-insertion auditing framework that measures worst-case information leakage from private in-context learning methods against DP guarantees

Membership Inference Attack Sensitive Information Disclosure nlp
3 citations 1 influentialPDF
defense arXiv Dec 7, 2025 · Dec 2025

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Jehyeok Yeon, Federico Cinus, Yifan Wu et al. · University of Illinois Urbana-Champaign · University of Southern California +1 more

Proposes graph-regularized sparse autoencoders to capture distributed LLM safety representations for adaptive jailbreak defense with 82% refusal rate

Prompt Injection nlp
1 citations PDF
benchmark arXiv Dec 4, 2025 · Dec 2025

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Jinbo Liu, Defu Cao, Yifei Wei et al. · University of Southern California · Florida State University +1 more

Benchmarks PII leakage in multi-agent LLM systems across six topologies, showing dense connectivity and proximity amplify adversarial memory extraction

Sensitive Information Disclosure nlp
1 citations 1 influentialPDF
defense arXiv Nov 18, 2025 · Nov 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna et al. · University of Southern California · Amazon AGI

Reveals that narrow refusal unlearning on LLMs triggers emergent misalignment in unrelated safety domains, and proposes a retain-data defense to contain it

Transfer Learning Attack Prompt Injection nlp
3 citations PDF
attack arXiv Nov 16, 2025 · Nov 2025

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

Mukkesh Ganesh, Kaushik Iyer, Arun Baalaaji Sankar Ananthan · University of Southern California

Hijacks LLM conversation narratives mid-generation by overwriting KV cache segments with precomputed cache from an unrelated topic

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Nov 14, 2025 · Nov 2025

A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Haoyan Xu, Ruizhi Qian, Jiate Li et al. · University of Southern California · Florida State University +2 more

Systematically extracts Graph Foundation Models via black-box embedding regression, cloning victim models at 0.07% of original training cost

Model Theft graphmultimodal
PDF
Loading more papers…