defense 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan , Alexander Matt Turner , Mark Kurzeja , David K. Elson , Rohin Shah

Google

0 citations · 29 references · arXiv

Published on arXiv

2510.27062

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BCT and ACT both reduce sycophancy equally well on Gemini 2.5 Flash, but BCT achieves superior jailbreak reduction, and both methods avoid capability degradation by using the model's own outputs as training data.

Activation Consistency Training (ACT)

Novel technique introduced

An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

Key Contributions

Introduces Activation Consistency Training (ACT), a novel method that enforces behavioral invariance over internal model activations across prompt augmentations
Demonstrates that both BCT and ACT reduce Gemini 2.5 Flash's susceptibility to sycophancy equally, while BCT outperforms ACT at jailbreak reduction
Argues that alignment failures (sycophancy, jailbreaks) are better framed as consistency problems rather than optimal-response problems, enabling self-supervised training on model's own outputs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

llm safetyjailbreak defensesycophancy reductionalignment

Read PDF arXiv DOI

Consistency Training Helps Stop Sycophancy and Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Securing AI Agents Against Prompt Injection Attacks

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models