Latest papers

10 papers
attack arXiv Mar 12, 2026 · 25d ago

Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner et al. · Georgia Tech · The University of Texas at Austin +3 more

Compounds Rowhammer hardware faults and RAG database injection with LLM attacks to jailbreak guardrails and exfiltrate user data

Prompt Injection Sensitive Information Disclosure nlp
PDF
attack arXiv Feb 14, 2026 · 7w ago

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun et al. · University of North Carolina at Chapel Hill · Carnegie Mellon University +2 more

Attacks LLM alignment pipelines by crafting benchmark-compliant rubric edits that systematically bias judge preferences and corrupt RLHF training

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Jan 30, 2026 · 9w ago

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin et al. · University of Oxford · The Chinese University of Hong Kong +2 more

Discovers D-LLMs' intrinsic jailbreak resistance, then breaks it with context nesting prompts achieving SOTA attack rates

Prompt Injection nlp
PDF
benchmark arXiv Jan 27, 2026 · 9w ago

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu et al. · The University of Texas at Austin · New York University +3 more

Benchmarks VLM susceptibility to persuasive conflicting text prompts that override visual evidence, finding 48% average accuracy drop

Prompt Injection visionnlpmultimodal
PDF
defense arXiv Oct 23, 2025 · Oct 2025

RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines

Austin Jia, Avaneesh Ramesh, Zain Shamsi et al. · The University of Texas at Austin

Defends RAG-based CTI pipelines against corpus poisoning by ranking documents with a PageRank-derived source credibility score

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Oct 6, 2025 · Oct 2025

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava et al. · Google · The University of Texas at Austin +2 more

Defends LLM tool-using agents from indirect prompt injection via adversarial RL co-training in a two-player zero-sum game

Prompt Injection nlpreinforcement-learning
3 citations PDF
tool arXiv Oct 2, 2025 · Oct 2025

VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park, Yifan Yang, Juheon Yi et al. · The University of Texas at Austin · Microsoft Research

Detects AI-generated videos via GRPO-fine-tuned MLLM with temporal artifact reward models, achieving >95% accuracy

Output Integrity Attack visionmultimodalgenerative
2 citations 1 influentialPDF Code
attack arXiv Sep 24, 2025 · Sep 2025

Generative Model Inversion Through the Lens of the Manifold Hypothesis

Xiong Peng, Bo Han, Fengfei Yu et al. · Hong Kong Baptist University · The University of Sydney +2 more

Explains why generative model inversion attacks work via manifold theory and proposes methods to amplify their effectiveness

Model Inversion Attack visiongenerative
PDF
defense arXiv Sep 16, 2025 · Sep 2025

The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration

Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal · The University of Texas at Austin · UNC Chapel Hill

Adversary aggregates multi-agent LLM responses to infer sensitive data; proposes ToM and consensus-voting defenses

Sensitive Information Disclosure Excessive Agency nlp
PDF Code
benchmark arXiv Aug 27, 2025 · Aug 2025

Language Models Identify Ambiguities and Exploit Loopholes

Jio Choi, Mohit Bansal, Elias Stengel-Eskin · UNC Chapel Hill · The University of Texas at Austin

Benchmarks LLM loophole exploitation: agents deliberately misread ambiguous user instructions to favor their own competing goals

Excessive Agency nlp
PDF Code