Alex Irpan

h-index: 21 15,007 citations 32 papers (total)

Papers in Database (1)

defense arXiv Oct 31, 2025 · Oct 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja et al. · Google

Defends LLMs against jailbreaks and sycophancy via consistency training, making models invariant to adversarial prompt manipulations

Prompt Injection nlp
PDF