Alwin Peng

Papers in Database (1)

attack arXiv Mar 30, 2026 · 9d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. · Anthropic · Virginia Tech +1 more

Adversarial fine-tuning attack that bypasses Constitutional Classifiers via curriculum learning, achieving 99% evasion with minimal capability loss

Prompt Injection Training Data Poisoning nlp
PDF