defense 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

1 citations · 50 references · arXiv

Published on arXiv

2512.05745

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ARGUS achieves robust defense against multimodal IPI attacks across image, video, and audio modalities while maximally preserving MLLM utility through decoupled direction search and adaptive steering strength

ARGUS

Novel technique introduced

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.

Key Contributions

Discovers that MLLM instruction-following behavior is encoded in a subspace of the representation space, enabling modality-agnostic defense via activation steering
Proposes ARGUS, which finds an optimal defense direction that decouples safety from utility degradation and applies adaptive strength steering for a better safety-utility trade-off
Introduces a lightweight injection detection stage and a post-filtering verification stage to activate defense on-demand and confirm its success

🛡️ Threat Analysis

Details

Domains

multimodalnlp

Model Types

vlmllmmultimodal

Threat Tags

inference_timeblack_box

Applications

multimodal ai assistantsvision-language models

Read PDF arXiv DOI

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

Clouding the Mirror: Stealthy Prompt Injection Attacks Targeting LLM-based Phishing Detection

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models