defense 2026

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Inderjeet Singh ^1,2, Vikas Pahuja ^1,2, Aishvariya Priya Rathina Sabapathy ^1,2, Chiara Picardi ^1,2, Amit Giloni ^1,2, Roman Vainshtein ^1,2, Andrés Murillo ^1,2, Hisashi Kojima ², Motoyoshi Sekiya ², Yuki Unno ², Junichi Suga ²

¹ Fujitsu Research of Europe

² Fujitsu Limited

0 citations · 36 references · arXiv (Cornell University)

Published on arXiv

2602.21447

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 6.50x average reduction in Attack Success Rate across 43,774 instances spanning five threat surfaces with negligible utility cost

MMA-RAG^T (Modular Trust Agent)

Novel technique introduced

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.

Key Contributions

POMDP formalization of agentic RAG security where adversarial intent is a latent variable inferred from noisy multi-stage pipeline observations, with theoretical propositions characterizing when stateless multi-point intervention yields zero marginal benefit
Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning across configurable internal checkpoints, operating as a model-agnostic inference-time overlay requiring no fine-tuning
Factorial ablation validating that statefulness (+26.4 pp) and spatial coverage (+13.6 pp) are individually necessary conditions for effective defense, and identifying a semantic resolution limit for tool-flip attacks (1.29x reduction) where text-based belief tracking must be augmented by deterministic governance

🛡️ Threat Analysis

Input Manipulation Attack

The threat model explicitly includes adversarial visual inputs to VLMs as a primary attack surface — images encoding embedded directives (citing FigStep and Chameleon), which are adversarial inputs that jailbreak or manipulate VLM outputs; the MTA defense framework is evaluated against these multimodal adversarial input attacks.

Details

Domains

multimodalnlp

Model Types

llmvlmtransformer

Threat Tags

black_boxinference_time

Datasets

ART-SafeBench

Applications

multimodal agentic ragllm agentsretrieval-augmented generation

Read PDF arXiv DOI

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Reimagining Safety Alignment with An Image

DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models