Rohin Shah

h-index: 6 2,091 citations 10 papers (total)

Papers in Database (1)

defense arXiv Oct 31, 2025 · Oct 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja et al. · Google

Defends LLMs against jailbreaks and sycophancy via consistency training, making models invariant to adversarial prompt manipulations

Prompt Injection nlp
PDF