Latest papers

1 papers
defense arXiv Feb 9, 2026 · 8w ago

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro · International University of La Rioja

Defends LLMs against deceptive alignment by training models to be blind to evaluation-vs-deployment regime cues via adversarial invariance

Model Poisoning nlp
PDF