defense 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan , Alexander Matt Turner , Mark Kurzeja , David K. Elson , Rohin Shah

0 citations · 29 references · arXiv

α

Published on arXiv

2510.27062

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BCT and ACT both reduce sycophancy equally well on Gemini 2.5 Flash, but BCT achieves superior jailbreak reduction, and both methods avoid capability degradation by using the model's own outputs as training data.

Activation Consistency Training (ACT)

Novel technique introduced


An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.


Key Contributions

  • Introduces Activation Consistency Training (ACT), a novel method that enforces behavioral invariance over internal model activations across prompt augmentations
  • Demonstrates that both BCT and ACT reduce Gemini 2.5 Flash's susceptibility to sycophancy equally, while BCT outperforms ACT at jailbreak reduction
  • Argues that alignment failures (sycophancy, jailbreaks) are better framed as consistency problems rather than optimal-response problems, enabling self-supervised training on model's own outputs

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
llm safetyjailbreak defensesycophancy reductionalignment