attack 2026

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies 1,2, Giorgi Giglemiani 1, Edmund Lau 1, Eric Winsor 1, Geoffrey Irving 1, Yarin Gal 1,2

0 citations · 45 references · arXiv (Cornell University)

α

Published on arXiv

2602.15001

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BPJ is the first fully automated attack to produce universal jailbreaks against Anthropic's Constitutional Classifiers and GPT-5's input classifier without relying on human-crafted seed attacks, verified by Anthropic and OpenAI respectively.

Boundary Point Jailbreaking (BPJ)

Novel technique introduced


Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.


Key Contributions

  • BPJ: first fully automated black-box jailbreak algorithm that succeeds against Constitutional Classifiers (Anthropic) and GPT-5's input classifier using only a single bit of feedback (flagged/not flagged) per query
  • Curriculum learning strategy that converts a harmful target string into a sequence of intermediate attack targets of increasing difficulty (noise interpolation)
  • Boundary point selection mechanism that identifies evaluation points most sensitive to small changes in attack strength, enabling effective optimization under binary feedback

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
Biological misuse question sets (custom)Constitutional Classifier evaluation suite
Applications
llm safety classifiersconstitutional classifiersharmful content extraction prevention