Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.

Key Contributions

Agentic Moderation framework that reconceptualizes LVLM safety alignment as a collaborative multi-agent process with specialized roles (Shield, Responder, Evaluator, Reflector)
Model-agnostic, inference-time jailbreak defense requiring no model retraining that provides context-aware and interpretable moderation beyond binary safe/unsafe classification
Empirical validation across five datasets and four LVLMs showing 7–19% reduction in Attack Success Rate and 4–20% improvement in Refusal Rate

🛡️ Threat Analysis

Input Manipulation Attack

The framework defends against adversarial visual inputs to VLMs — including pixel-level perturbations embedding harmful intent in images, typography-based attacks, and cross-modal adversarial perturbations — warranting ML01 co-tagging per the dual ML01+LLM01 rule for adversarial visual inputs that jailbreak VLMs.

Details

Domains

multimodalvisionnlp

Model Types

vlmllmmultimodal

Threat Tags

inference_timeblack_box

Datasets

MM-SafetyBenchJailBreakV-28KFigStep

Applications

2025 1 cit.

Input Manipulation Attack

89%

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Reimagining Safety Alignment with An Image

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Risk-adaptive Activation Steering for Safe Multimodal Large Language Models