Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Tao Huang 1,2,3, Rui Wang 2,3, Xiaofei Liu 1,2,3, Yi Qin 1,2,3, Li Duan 4, Liping Jing 1,2,3
1 State Key Laboratory of Advanced Rail Autonomous Operation
2 Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence
4 Beijing Key Laboratory of Security and Privacy in Intelligent Transportation
Published on arXiv
2602.05535
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
EUQ achieves up to 10.5% relative AUROC improvement over strong baselines across four LVLM misbehavior categories including jailbreaks and adversarial vulnerabilities.
EUQ (Evidential Uncertainty Quantification)
Novel technique introduced
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
Key Contributions
- EUQ: a training-free framework that decomposes epistemic uncertainty into conflict (internal contradiction) and ignorance (missing knowledge) using Dempster-Shafer Theory on output head features
- Unified detection across four LVLM misbehavior categories — hallucinations, jailbreaks, adversarial vulnerabilities, and OOD failures — within a single forward pass
- Empirical finding that hallucinations correlate with high internal conflict and OOD failures with high ignorance, enabling interpretable diagnostics via layer-wise uncertainty dynamics
🛡️ Threat Analysis
Paper explicitly evaluates detection of adversarial vulnerabilities in LVLMs — adversarial inputs that cause incorrect or harmful outputs — which is a core ML01 threat. EUQ is evaluated as a defense against adversarial input manipulation on VLMs.