Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies 1,2, Giorgi Giglemiani 1, Edmund Lau 1, Eric Winsor 1, Geoffrey Irving 1, Yarin Gal 1,2
Published on arXiv
2602.15001
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
BPJ is the first fully automated attack to produce universal jailbreaks against Anthropic's Constitutional Classifiers and GPT-5's input classifier without relying on human-crafted seed attacks, verified by Anthropic and OpenAI respectively.
Boundary Point Jailbreaking (BPJ)
Novel technique introduced
Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.
Key Contributions
- BPJ: first fully automated black-box jailbreak algorithm that succeeds against Constitutional Classifiers (Anthropic) and GPT-5's input classifier using only a single bit of feedback (flagged/not flagged) per query
- Curriculum learning strategy that converts a harmful target string into a sequence of intermediate attack targets of increasing difficulty (noise interpolation)
- Boundary point selection mechanism that identifies evaluation points most sensitive to small changes in attack strength, enabling effective optimization under binary feedback