defense 2026

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

0 citations · 33 references · arXiv (Cornell University)

Published on arXiv

2602.03402

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RAI substantially reduces multimodal jailbreak attack success rates on VLMs without compromising cross-modal reasoning utility, outperforming aggressive token-pruning defenses on the safety-utility tradeoff.

Risk Awareness Injection (RAI)

Novel technique introduced

Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

Key Contributions

Risk Awareness Injection (RAI): a training-free framework that constructs an Unsafe Prototype Subspace from LLM language embeddings to identify safety-critical directions in feature space
Targeted modulation of selected high-risk visual tokens to amplify unsafe signals in the cross-modal feature space, restoring LLM-like risk recognition without altering benign tokens
Empirical demonstration that RAI substantially reduces attack success rate across multiple multimodal jailbreak benchmarks while preserving VLM task utility

🛡️ Threat Analysis

Input Manipulation Attack

The defense targets adversarial visual inputs to VLMs — specifically high-risk visual tokens that carry jailbreak signals — by modulating them in the cross-modal feature space at inference time. The attacks being defended against (e.g., FigStep, JailbreakV-28K) include adversarially crafted visual inputs designed to bypass safety.

Details

Domains

visionnlpmultimodal

Model Types

vlmllmtransformer

Threat Tags

inference_timedigitalblack_box

Datasets

JailbreakV-28KFigStep

Applications

vision-language modelsmultimodal safety alignmentjailbreak defense

Read PDF arXiv DOI

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Directional Embedding Smoothing for Robust Vision Language Models

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models

Randomized Smoothing Meets Vision-Language Models

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector