benchmark 2026

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Xingjun Ma ^1,2, Yixu Wang ¹, Hengyuan Xu ¹, Yutao Wu ³, Yifan Ding ¹, Yunhan Zhao ¹, Zilong Wang ¹, Jiabin Hua ¹, Ming Wen ^1,2, Jianan Liu ^1,2, Ranjie Duan , Yifeng Gao ¹, Yingshui Tan , Yunhao Chen ¹, Hui Xue , Xin Wang ¹, Wei Cheng , Jingjing Chen ¹, Zuxuan Wu ¹, Bo Li ⁴, Yu-Gang Jiang ¹

¹ Fudan University

² Shanghai Innovation Institute

³ Deakin University

⁴ UIUC

1 citations · arXiv

Published on arXiv

2601.10527

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

All six frontier models drop below 6% safety rates under adversarial testing despite strong standard benchmark performance, revealing a critical gap between benchmark and real-world safety across language, vision-language, and image generation modalities.

The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.

Key Contributions

Unified multi-dimensional safety evaluation protocol combining benchmark, adversarial, multilingual, and compliance evaluations across text, vision-language, and text-to-image modalities
Safety leaderboards and per-model profiles exposing highly uneven safety trade-offs among GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Empirical finding that all six frontier models fail catastrophically under adversarial conditions with worst-case safety rates below 6%, despite strong standard benchmark scores

🛡️ Threat Analysis

Details

Domains

nlpmultimodalvisiongenerative

Model Types

llmvlmdiffusionmultimodaltransformer

Threat Tags

black_boxinference_time

Applications

llm chatbotsvision-language modelstext-to-image generation

Read PDF arXiv DOI

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

RunawayEvil: Jailbreaking the Image-to-Video Generative Models

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs