Evaluating Adversarial Vulnerabilities in Modern Large Language Models
Published on arXiv
2511.17666
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Cross-bypass attacks — where one LLM generates adversarial prompts to jailbreak the other — were particularly effective, and Gemini 2.5 Flash and GPT-4o mini showed meaningful disparity in jailbreak susceptibility across attack categories.
Self-bypass / Cross-bypass
Novel technique introduced
The recent boom and rapid integration of Large Language Models (LLMs) into a wide range of applications warrants a deeper understanding of their security and safety vulnerabilities. This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4 (specifically the GPT-4o mini model accessible in the free tier). The research utilized two main bypass strategies: 'self-bypass', where models were prompted to circumvent their own safety protocols, and 'cross-bypass', where one model generated adversarial prompts to exploit vulnerabilities in the other. Four attack methods were employed - direct injection, role-playing, context manipulation, and obfuscation - to generate five distinct categories of unsafe content: hate speech, illegal activities, malicious code, dangerous content, and misinformation. The success of the attack was determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations in their safety implementations or architectural design. Cross-bypass attacks were particularly effective, indicating that an ample amount of vulnerabilities exist in the underlying transformer architecture. This research contributes a scalable framework for automated AI red-teaming and provides data-driven insights into the current state of LLM safety, underscoring the complex challenge of balancing model capabilities with robust safety mechanisms.
Key Contributions
- Comparative empirical evaluation of jailbreak susceptibility between Gemini 2.5 Flash and GPT-4o mini under identical testing conditions
- Introduction of 'self-bypass' (model attacks its own safety) and 'cross-bypass' (model generates adversarial prompts targeting a competing model) paradigms as scalable red-teaming methodologies
- Detailed vulnerability mapping of four attack vectors (direct injection, role-playing, context manipulation, obfuscation) across five harmful content categories with severity scoring