Rohin Shah

defense arXiv Oct 31, 2025 · Oct 2025

Alex Irpan, Alexander Matt Turner, Mark Kurzeja et al. · Google

Defends LLMs against jailbreaks and sycophancy via consistency training, making models invariant to adversarial prompt manipulations

Prompt Injection nlp

Papers in Database (1)