defense 2025

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Juan Ren , Mark Dras , Usman Naseem

1 citations · 32 references · BigData Congress

α

Published on arXiv

2510.25179

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces Attack Success Rate by 7–19% and improves Refusal Rate by 4–20% across five datasets and four LVLMs without requiring model retraining.

Agentic Moderation

Novel technique introduced


Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.


Key Contributions

  • Agentic Moderation framework that reconceptualizes LVLM safety alignment as a collaborative multi-agent process with specialized roles (Shield, Responder, Evaluator, Reflector)
  • Model-agnostic, inference-time jailbreak defense requiring no model retraining that provides context-aware and interpretable moderation beyond binary safe/unsafe classification
  • Empirical validation across five datasets and four LVLMs showing 7–19% reduction in Attack Success Rate and 4–20% improvement in Refusal Rate

🛡️ Threat Analysis

Input Manipulation Attack

The framework defends against adversarial visual inputs to VLMs — including pixel-level perturbations embedding harmful intent in images, typography-based attacks, and cross-modal adversarial perturbations — warranting ML01 co-tagging per the dual ML01+LLM01 rule for adversarial visual inputs that jailbreak VLMs.


Details

Domains
multimodalvisionnlp
Model Types
vlmllmmultimodal
Threat Tags
inference_timeblack_box
Datasets
MM-SafetyBenchJailBreakV-28KFigStep
Applications
vision-language modelsmultimodal ai safetycontent moderation