Latest papers

2 papers
defense arXiv Feb 5, 2026 · 8w ago

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Zhenxiong Yu, Zhi Yang, Zhiheng Jin et al. · SUFE · NUS +5 more

Event-driven LLM agent defense that selectively triggers hierarchical screening against prompt injection and multi-stage agent attacks

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Sep 22, 2025 · Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta et al. · Amazon Nova Responsible AI · Center for AI Safety +2 more

Benchmark dataset for detecting LLMs that hide malicious chain-of-thought behind benign outputs via adversarial system prompt injections

Prompt Injection nlp
2 citations PDF