Igor Santos-Grueiro

h-index: 0 0 citations 0 papers (total)

Papers in Database (1)

defense arXiv Feb 9, 2026 · 8w ago

When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro · International University of La Rioja

Defends LLMs against deceptive alignment by training models to be blind to evaluation-vs-deployment regime cues via adversarial invariance

Model Poisoning nlp
PDF