benchmark arXiv Sep 10, 2025 · Sep 2025
Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more
Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models
Prompt Injection nlp
Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.
llm transformer Independent · University of California · University of Waterloo +5 more