attack 2026

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies ^1,2, Giorgi Giglemiani ¹, Edmund Lau ¹, Eric Winsor ¹, Geoffrey Irving ¹, Yarin Gal ^1,2

¹ UK AI Security Institute

² University of Oxford

0 citations · 45 references · arXiv (Cornell University)

Published on arXiv

2602.15001

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BPJ is the first fully automated attack to produce universal jailbreaks against Anthropic's Constitutional Classifiers and GPT-5's input classifier without relying on human-crafted seed attacks, verified by Anthropic and OpenAI respectively.

Boundary Point Jailbreaking (BPJ)

Novel technique introduced

Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.

Key Contributions

BPJ: first fully automated black-box jailbreak algorithm that succeeds against Constitutional Classifiers (Anthropic) and GPT-5's input classifier using only a single bit of feedback (flagged/not flagged) per query
Curriculum learning strategy that converts a harmful target string into a sequence of intermediate attack targets of increasing difficulty (noise interpolation)
Boundary point selection mechanism that identifies evaluation points most sensitive to small changes in attack strength, enabling effective optimization under binary feedback

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

Biological misuse question sets (custom)Constitutional Classifier evaluation suite

Applications

llm safety classifiersconstitutional classifiersharmful content extraction prevention

Read PDF arXiv DOI

Boundary Point Jailbreaking of Black-Box LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

Malicious Repurposing of Open Science Artefacts by Using Large Language Models

The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Anecdoctoring: Automated Red-Teaming Across Language and Place

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking