defense 2026

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang 1,2, Yuxin Chen 2,3, Gang Xu 2, Tao He 4, Hongjie Jiang 1, Ming Li 2

0 citations · 33 references · arXiv (Cornell University)

α

Published on arXiv

2602.03402

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RAI substantially reduces multimodal jailbreak attack success rates on VLMs without compromising cross-modal reasoning utility, outperforming aggressive token-pruning defenses on the safety-utility tradeoff.

Risk Awareness Injection (RAI)

Novel technique introduced


Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.


Key Contributions

  • Risk Awareness Injection (RAI): a training-free framework that constructs an Unsafe Prototype Subspace from LLM language embeddings to identify safety-critical directions in feature space
  • Targeted modulation of selected high-risk visual tokens to amplify unsafe signals in the cross-modal feature space, restoring LLM-like risk recognition without altering benign tokens
  • Empirical demonstration that RAI substantially reduces attack success rate across multiple multimodal jailbreak benchmarks while preserving VLM task utility

🛡️ Threat Analysis

Input Manipulation Attack

The defense targets adversarial visual inputs to VLMs — specifically high-risk visual tokens that carry jailbreak signals — by modulating them in the cross-modal feature space at inference time. The attacks being defended against (e.g., FigStep, JailbreakV-28K) include adversarially crafted visual inputs designed to bypass safety.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmtransformer
Threat Tags
inference_timedigitalblack_box
Datasets
JailbreakV-28KFigStep
Applications
vision-language modelsmultimodal safety alignmentjailbreak defense