Kevin Zhu

defense arXiv Feb 21, 2026 · 6w ago

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav et al.

Diffusion-based defense projects LLM hidden states onto benign manifolds at inference time to neutralize jailbreak attacks

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Dec 12, 2025 · Dec 2025

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip, Anthony Channg, Diana Kim et al. · Algoverse AI Research · Colorado State University +1 more

Evaluates safety protocols for multi-agent LLM systems where an untrusted decomposer can inject malicious subtask instructions undetectable by monitors

Excessive Agency Prompt Injection nlp

PDF Code

defense arXiv Jan 18, 2026 · 11w ago

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma et al. · Algoverse AI Research · University of California

Defends LLM pipelines against prompt injection by detecting semantic embedding drift via cosine similarity, achieving 93%+ accuracy zero-shot

Prompt Injection nlp

PDF Code

Papers in Database (3)

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs