Dawn Song

attack arXiv Sep 16, 2025 · Sep 2025

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino et al. · University of California

Isolates concept-specific refusal vectors to surgically bypass LLM safety on targeted topics like WMDs using ~12 examples

Prompt Injection nlp

PDF

benchmark arXiv Feb 13, 2026 · 7w ago

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan et al. · Northeastern University · Virtue AI +2 more

Benchmarks multi-turn jailbreaks in tool-using LLM agents and proposes ToolShield, a self-exploration defense reducing ASR by 30%

Prompt Injection Insecure Plugin Design nlp

PDF Code

survey arXiv Mar 11, 2026 · 26d ago

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Juhee Kim, Xiaoyuan Liu, Zhun Wang et al. · University of California · Seoul National University +1 more

Surveys attacks and defenses across agentic LLM systems, covering prompt injection, insecure tool use, and excessive agency risks

Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal

PDF

defense arXiv Feb 26, 2026 · 5w ago

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation

Yanpei Guo, Wenjie Qu, Linyu Wu et al. · National University of Singapore · Nanyang Technological University +1 more

Auditing framework using verifiable computation to detect LLM provider fraud — model substitution, quantization abuse, token overbilling — with under 1% overhead

Output Integrity Attack nlp

PDF Code

Papers in Database (4)

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation