defense 2025

ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models

Chung-En Johnny Yu 1, Hsuan-Chih , Chen , Brian Jalaian 1, Nathaniel D. Bastian 2

0 citations

α

Published on arXiv

2509.15435

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ORCA improves VLM accuracy under adversarial perturbations by +20.11% on average on POPE without any adversarial training or dedicated defense mechanisms.

ORCA (Observe-Reason-Critique-Act)

Novel technique introduced


Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through test-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe--Reason--Critique--Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLM performance by +3.64\% to +40.67\% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11\% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20\% to +48.00\% across evaluation metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.


Key Contributions

  • ORCA agentic framework using an Observe-Reason-Critique-Act loop that queries multiple small vision models (<3B params) to validate and refine VLM predictions at test time without accessing model internals or retraining
  • Demonstrates emergent adversarial robustness as a side effect of multi-model cross-validation reasoning, achieving +20.11% average accuracy gain on adversarially perturbed POPE inputs
  • Supports auditable decision-making via stored intermediate reasoning traces, improving hallucination performance by +3.64% to +40.67% on clean POPE benchmark

🛡️ Threat Analysis

Input Manipulation Attack

ORCA defends against adversarial visual perturbations (inference-time input manipulation attacks) that cause VLMs to produce incorrect outputs. The paper explicitly evaluates on adversarially perturbed images and achieves +20.11% average accuracy gain under adversarial perturbation — this is a defense against visual adversarial input manipulation.


Details

Domains
visionmultimodalnlp
Model Types
vlmmultimodal
Threat Tags
black_boxinference_timedigital
Datasets
POPEAMBER
Applications
visual question answeringimage captioningvision-language model reliability