tool 2026

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang , Yunhao Chen , Juncheng Li , Yixu Wang , Yang Yao , Tianle Gu , Jie Li , Yan Teng , Yingchun Wang , Xia Hu

Shanghai Artificial Intelligence Laboratory

4 citations · 1 influential · arXiv

Published on arXiv

2601.01592

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Even state-of-the-art frontier MLLMs exhibit average Attack Success Rates up to 49.14%, with advanced multi-turn and multi-agent strategies like EvoSynth and X-Teaming achieving >90% ASR against models otherwise resistant to static jailbreaks.

OpenRT

Novel technique introduced

The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.

Key Contributions

OpenRT: a modular, high-throughput red-teaming framework with an adversarial kernel separating model integration, datasets, attack strategies, judging, and metrics into five decoupled dimensions
Integration of 37 diverse attack methodologies spanning white-box gradients, multimodal perturbations, and multi-agent evolutionary strategies within a unified async runtime
Large-scale empirical study on 20 frontier MLLMs (GPT-5.2, Claude 4.5, Gemini 3 Pro, etc.) revealing average ASRs as high as 49.14% and that reasoning models offer no inherent robustness advantage against multi-turn jailbreaks

🛡️ Threat Analysis

Input Manipulation Attack

Framework explicitly integrates white-box gradient-based attacks and multimodal adversarial perturbations (visual inputs) against VLMs, which are canonical ML01 input manipulation attacks at inference time.

Details

Domains

nlpmultimodalvision

Model Types

llmvlmmultimodal

Threat Tags

white_boxblack_boxinference_time

Applications

multimodal llm safety evaluationmllm red-teamingjailbreak benchmarking

Read PDF arXiv DOI Code

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Reimagining Safety Alignment with An Image

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models