Latest papers

1,223 papers
attack arXiv Apr 2, 2026 · 4d ago

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Su-Hyeon Kim, Hyundong Jin, Yejin Lee et al. · Yonsei University

Circuit-guided feature selection for LLM jailbreaking that identifies causal refusal features via cross-layer transcoders and boundary prompts

Prompt Injection nlp
PDF
attack arXiv Apr 2, 2026 · 4d ago

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa, Zihan Ye, Yang Lu et al. · University of Nottingham · Xi’an Jiaotong-Liverpool University +1 more

Low-effort prompt-based jailbreaks bypass text-to-image safety filters using linguistic reframing, achieving 74% attack success

Prompt Injection multimodalgenerative
PDF
survey arXiv Apr 1, 2026 · 5d ago

Safety, Security, and Cognitive Risks in World Models

Manoj Parmar · SovereignAI Security Labs

Unified threat model for world model AI systems covering adversarial attacks, data poisoning, alignment risks, and cognitive security

Input Manipulation Attack Data Poisoning Attack Model Poisoning Prompt Injection Excessive Agency reinforcement-learningmultimodalvisionnlp
PDF
benchmark arXiv Apr 1, 2026 · 5d ago

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan et al. · George Mason University · Tulane University +2 more

Benchmark of 120 prompt injection attacks on personal AI agents across skill files, emails, and web content

Prompt Injection Excessive Agency nlpmultimodal
PDF
defense arXiv Apr 1, 2026 · 5d ago

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Zikai Zhang, Rui Hu, Olivera Kotevska et al. · University of Nevada · Oak Ridge National Laboratory

Detects LLM jailbreak attacks using logit distributions over numerical tokens, achieving 22.66% ASR reduction with minimal overhead

Prompt Injection nlp
PDF
benchmark arXiv Apr 1, 2026 · 5d ago

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi · Keysight Technologies

Automated framework evaluating LLM system instruction leakage via encoding attacks, achieving 70%+ success rates with structured formats

Sensitive Information Disclosure Prompt Injection nlp
PDF Code
defense arXiv Apr 1, 2026 · 5d ago

AgentWatcher: A Rule-based Prompt Injection Monitor

Yanting Wang, Wei Zou, Runpeng Geng et al. · The Pennsylvania State University

Rule-based prompt injection detector using causal attribution to identify malicious context segments in long-context LLM agents

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Apr 1, 2026 · 5d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp
PDF Code
attack arXiv Apr 1, 2026 · 5d ago

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Jiaqing Li, Zhibo Zhang, Shide Zhou et al. · Huazhong University of Science and Technology · Hubei University

Embeds latent trojans in individually safe LLMs that activate during model merging, bypassing safety alignment

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp
PDF
defense arXiv Mar 31, 2026 · 6d ago

Robust Multimodal Safety via Conditional Decoding

Anurag Kumar, Raghuveer Peri, Jon Burnsky et al. · The Ohio State University · AWS

Conditional decoding defense using internal safety classification that blocks multimodal jailbreaks across text, image, and audio inputs

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio
PDF
attack arXiv Mar 31, 2026 · 6d ago

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Meiwen Ding, Song Xia, Chenqi Kong et al. · Nanyang Technological University

Embeds imperceptible adversarial prompts into images via visual perturbations to jailbreak closed-source multimodal LLMs

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
survey arXiv Mar 31, 2026 · 6d ago

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch · University College London

Comprehensive AI safety thesis spanning mechanistic interpretability, sleeper agent defenses, jailbreaking frontier models, and autonomous agent misalignment

Input Manipulation Attack Prompt Injection Excessive Agency nlpvisionaudiomultimodal
PDF
survey arXiv Mar 31, 2026 · 6d ago

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Almasoud, Antony Anju, Marco Arazzi et al. · arXiv · University of Pavia +1 more

First comprehensive survey organizing 45 studies on security risks of LLM-as-a-Judge systems including adversarial manipulation and evaluation vulnerabilities

Prompt Injection nlp
PDF
defense arXiv Mar 30, 2026 · 7d ago

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

KrishnaSaiReddy Patil

Seven-layer defense framework for government AI chatbots achieving 73% detection against jailbreaks with graduated human escalation

Prompt Injection nlp
PDF
benchmark arXiv Mar 30, 2026 · 7d ago

Evaluating Privilege Usage of Agents on Real-World Tools

Quan Zhang, Lianhang Fu, Lvsi Lian et al. · East China Normal University · Xinjiang University +1 more

Benchmark evaluating LLM agents' privilege control under prompt injection attacks using real-world tools, finding 84.80% attack success

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
benchmark arXiv Mar 30, 2026 · 7d ago

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Haochuan Kevin Wang · Massachusetts Institute of Technology

Stage-level prompt injection benchmark tracking cryptographic canaries across four kill-chain stages in multi-agent systems

Prompt Injection nlp
PDF
attack arXiv Mar 30, 2026 · 7d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. · Anthropic · Virginia Tech +1 more

Adversarial fine-tuning attack that bypasses Constitutional Classifiers via curriculum learning, achieving 99% evasion with minimal capability loss

Prompt Injection Training Data Poisoning nlp
PDF
survey Transactions on Machine Learni... Mar 30, 2026 · 7d ago

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur · Google · Bennett University

Surveys adversarial attacks on multimodal LLMs, organizing threats by attacker objectives and linking attacks to architectural vulnerabilities

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio
PDF
attack arXiv Mar 30, 2026 · 7d ago

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Chengyin Hu, Jiaju Han, Xuemeng Sun et al.

Sparse adversarial attack on VLMs using X-shaped pixel perturbations that transfer across classification, captioning, and VQA tasks

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
attack arXiv Mar 29, 2026 · 8d ago

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Chengyin Hu, Xuemeng Sun, Jiajun Han et al.

Generates adversarial wrinkle-like surface deformations that fool VLMs on classification, captioning, and VQA through physically plausible non-rigid perturbations

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
Loading more papers…