defense 2026

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Tao Huang ^1,2,3, Rui Wang ^2,3, Xiaofei Liu ^1,2,3, Yi Qin ^1,2,3, Li Duan ⁴, Liping Jing ^1,2,3

¹ State Key Laboratory of Advanced Rail Autonomous Operation

² Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence

³ Beijing Jiaotong University

⁴ Beijing Key Laboratory of Security and Privacy in Intelligent Transportation

0 citations · arXiv (Cornell University)

Published on arXiv

2602.05535

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

EUQ achieves up to 10.5% relative AUROC improvement over strong baselines across four LVLM misbehavior categories including jailbreaks and adversarial vulnerabilities.

EUQ (Evidential Uncertainty Quantification)

Novel technique introduced

Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.

Key Contributions

EUQ: a training-free framework that decomposes epistemic uncertainty into conflict (internal contradiction) and ignorance (missing knowledge) using Dempster-Shafer Theory on output head features
Unified detection across four LVLM misbehavior categories — hallucinations, jailbreaks, adversarial vulnerabilities, and OOD failures — within a single forward pass
Empirical finding that hallucinations correlate with high internal conflict and OOD failures with high ignorance, enabling interpretable diagnostics via layer-wise uncertainty dynamics

🛡️ Threat Analysis

Input Manipulation Attack

Paper explicitly evaluates detection of adversarial vulnerabilities in LVLMs — adversarial inputs that cause incorrect or harmful outputs — which is a core ML01 threat. EUQ is evaluated as a defense against adversarial input manipulation on VLMs.

Details

Domains

visionnlpmultimodal

Model Types

vlmtransformer

Threat Tags

white_boxinference_time

Applications

vision-language modelsmultimodal ai systemsautonomous drivingmedical diagnosis

Read PDF arXiv DOI Code

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Randomized Smoothing Meets Vision-Language Models

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

Directional Embedding Smoothing for Robust Vision Language Models

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models