Kristiyan Haralambiev

Papers in Database (1)

attack arXiv Mar 26, 2026 ยท 11d ago

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

Proves activation probes fail on coherent misalignment where models genuinely believe harmful behavior is virtuous, unlike strategic deception

Input Manipulation Attack Prompt Injection nlp
PDF