Latest papers

16 papers
benchmark arXiv Apr 20, 2026 · 4w ago

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

Prasoon Goyal, Sattvik Sahai, Michael Johnston et al. · Amazon

Crowdsourced adversarial data generation framework where attackers create prompts and defenders respond, producing diverse safety alignment datasets

Prompt Injection nlp
PDF
attack arXiv Apr 10, 2026 · 5w ago

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He, Ning Wang et al. · University of Massachusetts Lowell · Virginia Tech +5 more

Adaptive query-based attack extracting private data from LLM agent memory, achieving 100% success via entropy-guided distribution estimation

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Apr 10, 2026 · 5w ago

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Jinqi Luo, Jinyu Yang, Tal Neiman et al. · University of Pennsylvania · Amazon +1 more

Activation steering defense using sparse autoencoders and concept dictionaries to safeguard multimodal LLMs against jailbreaks

Prompt Injection nlpvisionmultimodal
PDF
defense arXiv Apr 6, 2026 · 6w ago

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya, Kevin Scaria, Sagar Chaturvedi · Amazon

Dual-anchor gradient detection combined with deterministic refusal-token injection to prevent LLM jailbreaks while reducing false positives by 52%

Prompt Injection nlp
PDF Code
benchmark arXiv Feb 25, 2026 · 12w ago

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala et al. · Amazon · Cisco +2 more

Maps LLM safety failure topology using quality-diversity optimization to reveal behavioral attraction basins across three frontier models

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 12, 2026 · Jan 2026

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp
PDF Code
attack arXiv Jan 6, 2026 · Jan 2026

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde et al. · Amazon · New York University +1 more

Attacker-LLM-free multi-turn jailbreak via lexical anchor injection achieves 97-100% ASR on GPT/Claude/Llama in ~6.4 queries

Prompt Injection nlp
PDF
benchmark arXiv Dec 31, 2025 · Dec 2025

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Manish Bhatt, Adrian Wood, Idan Habler et al. · OWASP · Amazon +3 more

Adapts Go-Explore to red-team LLM tool-using agents, finding seed variance (8x spread) dominates algorithmic choice in prompt injection discovery

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp
PDF Code
attack arXiv Dec 9, 2025 · Dec 2025

MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks

Tailun Chen, Yu He, Yan Wang et al. · Zhejiang University · Alibaba Group +1 more

Black-box RAG corpus poisoning attack using persona-driven query synthesis, semantic anchoring, and adversarial preference optimization to mislead LLMs

Data Poisoning Attack Prompt Injection nlp
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa et al. · Virginia Tech · Princeton University +1 more

Defends LLMs against novel jailbreaks by training on diverse compositions of adversarial skill primitives extracted from 32 prior attacks

Prompt Injection nlp
1 citations PDF
defense arXiv Oct 19, 2025 · Oct 2025

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed et al. · University of Illinois Urbana-Champaign · Amazon

RL alignment for LLM search agents that cuts harmful outputs 70%+ via query-level reward shaping without sacrificing QA utility

Prompt Injection Excessive Agency nlpreinforcement-learning
2 citations PDF Code
defense arXiv Oct 17, 2025 · Oct 2025

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

Zhehao Zhang, Weijie Xu, Shixian Cui et al. · Amazon

Identifies reasoning distraction attacks on LRMs where injected prompt distractors slash accuracy 60%, proposes SFT+DPO defense

Prompt Injection nlp
PDF
benchmark arXiv Oct 4, 2025 · Oct 2025

How Catastrophic is Your LLM? Certifying Risk in Conversation

Chengxiao Wang, Isha Chaudhary, Qian Hu et al. · University of Illinois · Amazon

Statistical framework certifies catastrophic LLM response risk in multi-turn conversations via Markov sampling, finding up to 70% certified risk in frontier models

Prompt Injection nlp
1 citations PDF
tool EMNLP Sep 24, 2025 · Sep 2025

Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs

Michiharu Yamashita, Thanh Tran, Delvin Ce Zhang et al. · The Pennsylvania State University · Amazon +1 more

Novel graph-based detection system for LLM-generated fake resume trajectories, outperforming text-based detectors by up to 85%

Output Integrity Attack nlpgraph
3 citations PDF Code
benchmark arXiv Aug 13, 2025 · Aug 2025

Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development

Sattvik Sahai, Prasoon Goyal, Michael Johnston et al. · Amazon

Competition framework pitting automated jailbreak bots against safe LLM coding assistants in multi-turn adversarial tournaments

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
PDF
defense arXiv Aug 4, 2025 · Aug 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha · BITS Pilani · Meta AI +1 more

Traces LLM alignment failures to training corpus sources and defends against jailbreaks via inference filters, DPO regularization, and provenance-aware decoding

Prompt Injection nlp
PDF Code