Latest papers

1 papers
attack arXiv Apr 3, 2026 · 5d ago

Generalization Limits of Reinforcement Learning Alignment

Haruhi Shida, Koo Imai, Keigo Kansa · Aladdin Security Inc.

Compound jailbreak attack combining multiple techniques to exploit generalization gaps in RLHF safety training, achieving 71.4% success rate

Prompt Injection nlp
PDF