Nils Lukas

Papers in Database (1)

defense arXiv Mar 24, 2026 · 13d ago

Robust Safety Monitoring of Language Models via Activation Watermarking

Toluwani Aremu, Daniil Ognev, Samuele Poppi et al. · Mohamed bin Zayed University of Artificial Intelligence

Activation watermarking defense that detects adaptive jailbreak attacks on LLM safety monitors with 52% improvement over baselines

Prompt Injection nlp
PDF