Latest papers

12 papers
benchmark arXiv Feb 25, 2026 · 5w ago

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala et al. · Amazon · Cisco +2 more

Maps LLM safety failure topology using quality-diversity optimization to reveal behavioral attraction basins across three frontier models

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp
PDF Code
attack arXiv Jan 6, 2026 · Jan 2026

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde et al. · Amazon · New York University +1 more

Attacker-LLM-free multi-turn jailbreak via lexical anchor injection achieves 97-100% ASR on GPT/Claude/Llama in ~6.4 queries

Prompt Injection nlp
PDF
benchmark arXiv Dec 31, 2025 · Dec 2025

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Manish Bhatt, Adrian Wood, Idan Habler et al. · OWASP · Amazon +3 more

Adapts Go-Explore to red-team LLM tool-using agents, finding seed variance (8x spread) dominates algorithmic choice in prompt injection discovery

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Dec 9, 2025 · Dec 2025

MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks

Tailun Chen, Yu He, Yan Wang et al. · Zhejiang University · Alibaba Group +1 more

Black-box RAG corpus poisoning attack using persona-driven query synthesis, semantic anchoring, and adversarial preference optimization to mislead LLMs

Data Poisoning Attack Prompt Injection nlp
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa et al. · Virginia Tech · Princeton University +1 more

Defends LLMs against novel jailbreaks by training on diverse compositions of adversarial skill primitives extracted from 32 prior attacks

Prompt Injection nlp
1 citations PDF
defense arXiv Oct 19, 2025 · Oct 2025

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed et al. · University of Illinois Urbana-Champaign · Amazon

RL alignment for LLM search agents that cuts harmful outputs 70%+ via query-level reward shaping without sacrificing QA utility

Prompt Injection Excessive Agency nlpreinforcement-learning
2 citations PDF Code
defense arXiv Oct 17, 2025 · Oct 2025

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

Zhehao Zhang, Weijie Xu, Shixian Cui et al. · Amazon

Identifies reasoning distraction attacks on LRMs where injected prompt distractors slash accuracy 60%, proposes SFT+DPO defense

Prompt Injection nlp
PDF
benchmark arXiv Oct 4, 2025 · Oct 2025

How Catastrophic is Your LLM? Certifying Risk in Conversation

Chengxiao Wang, Isha Chaudhary, Qian Hu et al. · University of Illinois · Amazon

Statistical framework certifies catastrophic LLM response risk in multi-turn conversations via Markov sampling, finding up to 70% certified risk in frontier models

Prompt Injection nlp
1 citations PDF
tool EMNLP Sep 24, 2025 · Sep 2025

Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs

Michiharu Yamashita, Thanh Tran, Delvin Ce Zhang et al. · The Pennsylvania State University · Amazon +1 more

Novel graph-based detection system for LLM-generated fake resume trajectories, outperforming text-based detectors by up to 85%

Output Integrity Attack nlpgraph
3 citations PDF Code
benchmark arXiv Aug 13, 2025 · Aug 2025

Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development

Sattvik Sahai, Prasoon Goyal, Michael Johnston et al. · Amazon

Competition framework pitting automated jailbreak bots against safe LLM coding assistants in multi-turn adversarial tournaments

Prompt Injection nlp
PDF
defense arXiv Aug 4, 2025 · Aug 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha · BITS Pilani · Meta AI +1 more

Traces LLM alignment failures to training corpus sources and defends against jailbreaks via inference filters, DPO regularization, and provenance-aware decoding

Prompt Injection nlp
PDF Code