defense 2026

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham , Jerry Wei , Zihan Wang , Andrew Persic , Alwin Peng , Jordan Abderrachid , Raj Agarwal , Bobby Chen , Austin Cohen , Andy Dau , Alek Dimitriev , Rob Gilson , Logan Howard , Yijin Hua , Jared Kaplan , Jan Leike , Mu Lin , Christopher Liu , Vladimir Mikulik , Rohit Mittapalli , Clare O'Hara , Jin Pan , Nikhil Saxena , Alex Silverstein , Yue Song , Xunjie Yu , Giulio Zhou , Ethan Perez , Mrinank Sharma

Anthropic

6 citations · 28 references · arXiv

Published on arXiv

2601.04603

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 40x computational cost reduction and 0.05% refusal rate while no red-teamer successfully elicited detailed CBRN responses across all eight target queries in 1,700+ hours of testing

Constitutional Classifiers++

Novel technique introduced

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks -- no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.

Key Contributions

Exchange classifiers that evaluate model responses in full conversational context, addressing reconstruction and obfuscation attacks that defeated output-only classifiers
Two-stage classifier cascade using lightweight linear probe classifiers as a first stage to reduce compute overhead by 40x while maintaining 0.05% refusal rate
Linear activation probe classifiers with logit smoothing and weighted softmax loss, which can be ensembled with external classifiers for improved robustness at negligible cost

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

production traffic (shadow deployment)red-team evaluation (1700+ hours)

Applications

large language model safetycbrn harm preventionjailbreak defense

Read PDF arXiv DOI

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Securing AI Agents Against Prompt Injection Attacks

Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

ExpGuard: LLM Content Moderation in Specialized Domains

Auto-Tuning Safety Guardrails for Black-Box Large Language Models

SecInfer: Preventing Prompt Injection via Inference-time Scaling