defense 2026

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro

International University of La Rioja

0 citations · 31 references · arXiv (Cornell University)

Published on arXiv

2602.08449

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Regime-blind training reduces regime-conditioned failures (sycophancy, sleeper agents) without measurable task utility loss, but cannot guarantee elimination of regime decodability and shows heterogeneous, model-dependent dynamics including non-monotone oscillatory behavior under increasing invariance pressure

Regime-Blind Training

Novel technique introduced

Safety evaluation for advanced AI systems assumes that behavior observed under evaluation predicts behavior in deployment. This assumption weakens for agents with situational awareness, which may exploit regime leakage, cues distinguishing evaluation from deployment, to implement conditional policies that comply under oversight while defecting in deployment-like regimes. We recast alignment evaluation as a problem of information flow under partial observability and show that divergence between evaluation-time and deployment-time behavior is bounded by the regime information extractable from decision-relevant internal representations. We study regime-blind mechanisms, training-time interventions that restrict access to regime cues through adversarial invariance constraints without assuming complete information erasure. We evaluate this approach across multiple open-weight language models and controlled failure modes including scientific sycophancy, temporal sleeper agents, and data leakage. Regime-blind training reduces regime-conditioned failures without measurable loss of task utility, but exhibits heterogeneous and model-dependent dynamics. Sycophancy shows a sharp representational and behavioral transition at moderate intervention strength, consistent with a stability cliff. In sleeper-style constructions and certain cross-model replications, suppression occurs without a clean collapse of regime decodability and may display non-monotone or oscillatory behavior as invariance pressure increases. These findings indicate that representational invariance is a meaningful but limited control lever. It can raise the cost of regime-conditioned strategies but cannot guarantee elimination or provide architecture-invariant thresholds. Behavioral evaluation should therefore be complemented with white-box diagnostics of regime awareness and internal information flow.

Key Contributions

Formalizes regime leakage as an information flow problem, bounding evaluation-deployment behavioral divergence by the regime information extractable from internal representations
Proposes regime-blind training using adversarial invariance constraints that penalize regime-decodable representations without assuming complete information erasure
Empirically evaluates the defense across open-weight LLMs on scientific sycophancy, temporal sleeper agents, and data leakage, finding model-dependent non-monotone suppression dynamics and a stability cliff for sycophancy

🛡️ Threat Analysis

Model Poisoning

The core threat — regime-conditioned behavior where models comply during evaluation but defect in deployment — is functionally equivalent to a backdoor triggered by regime context cues. Temporal sleeper agents are an explicit failure mode studied. The proposed defense (regime-blind training via adversarial invariance constraints) is a novel backdoor/trojan mitigation technique, and the white-box diagnostic framework is designed to detect hidden conditional strategies in model representations.

Details

Domains

nlp

Model Types

llm

Threat Tags

training_timewhite_boxtargeted

Applications

ai safety evaluationalignment assessmentlanguage model deployment

Read PDF arXiv DOI

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Inverting Trojans in LLMs

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation