Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
Bilgehan Sel 1,2, Xuanli He 1,3, Alwin Peng 1, Ming Jin 2, Jerry Wei 1
Published on arXiv
2603.29038
Prompt Injection
OWASP LLM Top 10 — LLM01
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
Achieves 99%+ evasion of Constitutional Classifiers while maintaining less than 5% degradation on reasoning benchmarks for 14B+ parameter models
Trojan-Speak
Novel technique introduced
Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.
Key Contributions
- Adversarial fine-tuning method combining curriculum learning and GRPO-based RL to bypass Constitutional Classifiers
- Achieves 99%+ classifier evasion with <5% capability degradation on reasoning benchmarks (vs 25%+ for prior work)
- Demonstrates that activation-level probes substantially improve robustness against fine-tuning attacks on safety classifiers