A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Xingjun Ma 1,2, Yixu Wang 1, Hengyuan Xu 1, Yutao Wu 3, Yifan Ding 1, Yunhan Zhao 1, Zilong Wang 1, Jiabin Hua 1, Ming Wen 1,2, Jianan Liu 1,2, Ranjie Duan , Yifeng Gao 1, Yingshui Tan , Yunhao Chen 1, Hui Xue , Xin Wang 1, Wei Cheng , Jingjing Chen 1, Zuxuan Wu 1, Bo Li 4, Yu-Gang Jiang 1
Published on arXiv
2601.10527
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
All six frontier models drop below 6% safety rates under adversarial testing despite strong standard benchmark performance, revealing a critical gap between benchmark and real-world safety across language, vision-language, and image generation modalities.
The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.
Key Contributions
- Unified multi-dimensional safety evaluation protocol combining benchmark, adversarial, multilingual, and compliance evaluations across text, vision-language, and text-to-image modalities
- Safety leaderboards and per-model profiles exposing highly uneven safety trade-offs among GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
- Empirical finding that all six frontier models fail catastrophically under adversarial conditions with worst-case safety rates below 6%, despite strong standard benchmark scores