benchmark 2025

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Tom Perel

0 citations · 8 references · arXiv

α

Published on arXiv

2511.17666

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Cross-bypass attacks — where one LLM generates adversarial prompts to jailbreak the other — were particularly effective, and Gemini 2.5 Flash and GPT-4o mini showed meaningful disparity in jailbreak susceptibility across attack categories.

Self-bypass / Cross-bypass

Novel technique introduced


The recent boom and rapid integration of Large Language Models (LLMs) into a wide range of applications warrants a deeper understanding of their security and safety vulnerabilities. This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4 (specifically the GPT-4o mini model accessible in the free tier). The research utilized two main bypass strategies: 'self-bypass', where models were prompted to circumvent their own safety protocols, and 'cross-bypass', where one model generated adversarial prompts to exploit vulnerabilities in the other. Four attack methods were employed - direct injection, role-playing, context manipulation, and obfuscation - to generate five distinct categories of unsafe content: hate speech, illegal activities, malicious code, dangerous content, and misinformation. The success of the attack was determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations in their safety implementations or architectural design. Cross-bypass attacks were particularly effective, indicating that an ample amount of vulnerabilities exist in the underlying transformer architecture. This research contributes a scalable framework for automated AI red-teaming and provides data-driven insights into the current state of LLM safety, underscoring the complex challenge of balancing model capabilities with robust safety mechanisms.


Key Contributions

  • Comparative empirical evaluation of jailbreak susceptibility between Gemini 2.5 Flash and GPT-4o mini under identical testing conditions
  • Introduction of 'self-bypass' (model attacks its own safety) and 'cross-bypass' (model generates adversarial prompts targeting a competing model) paradigms as scalable red-teaming methodologies
  • Detailed vulnerability mapping of four attack vectors (direct injection, role-playing, context manipulation, obfuscation) across five harmful content categories with severity scoring

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
large language model safetyai red-teaming