defense 2026

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan , Tommy Tran , Vedant Yadav , Ava Cai , Kevin Zhu , Ruizhe Li , Maheep Chaudhary

0 citations · 25 references · arXiv (Cornell University)

α

Published on arXiv

2602.18782

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MANATEE reduces Attack Success Rate by up to 100% on certain jailbreak datasets across three LLM families without harmful training data or model fine-tuning

MANATEE (Manifold-Aligned Neutralization via Attractor-based Trajectory Editing and Enhancement)

Novel technique introduced


Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.


Key Contributions

  • Reframes LLM safety as density estimation over a benign representation manifold, eliminating the need for harmful training data or architectural modifications
  • MANATEE: a plug-and-play diffusion module that operates in hidden-state space to iteratively project anomalous representations toward safe regions at inference time
  • Achieves up to 100% ASR reduction on certain datasets across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it while preserving benign input utility

🛡️ Threat Analysis

Input Manipulation Attack

Defends against gradient-based adversarial input attacks (GCG/adversarial suffix optimization) at inference time by projecting anomalous hidden states back toward benign regions via score-based diffusion.


Details

Domains
nlp
Model Types
llmtransformerdiffusion
Threat Tags
white_boxblack_boxinference_time
Datasets
JailbreakBench
Applications
llm safetychatbotsdeployed llm systems