attack 2025

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu ^1,2, Abhinav Aggarwal ¹, Mehran Khodabandeh ¹, David Zhang ¹, Eric Hsin ¹, Li Chen ¹, Ankit Jain ¹, Matt Fredrikson ², Akash Bharadwaj ¹

¹ Meta Superintelligence Labs

² Carnegie Mellon University

0 citations · 27 references · arXiv

Published on arXiv

2601.03265

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Jailbreak-Zero achieves significantly higher attack success rates against GPT-4o and Claude 3.5 than existing techniques while maintaining human-readable and fidelity-preserving adversarial prompts.

Jailbreak-Zero

Novel technique introduced

This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.

Key Contributions

Policy-based red teaming framework that fine-tunes an attack LLM with preference data to generate high-volume, diverse adversarial prompts, achieving Pareto optimality across policy coverage, diversity, and prompt fidelity.
Demonstrates significantly higher attack success rates against proprietary models (GPT-4o, Claude 3.5) compared to existing state-of-the-art red teaming methods.
Produces human-readable adversarial prompts with minimal human intervention, enabling scalable, automated safety vulnerability discovery.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safety evaluationllm jailbreakingred teaming

Read PDF arXiv DOI

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

StealthGraph: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Adversarial versification in portuguese as a jailbreak operator in LLMs

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks