benchmark 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

⁷ Meta

⁸ University of Maryland

0 citations

Published on arXiv

2509.13333

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Evaluation awareness follows a power-law scaling with model size across 15 LLMs, enabling forecasting of deceptive capability-concealment behavior in future larger models

Linear probing on steering vector activations

Novel technique introduced

Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.

Key Contributions

First systematic study of evaluation awareness scaling across 15 models (0.27B–70B) from four LLM families
Discovery of a clear power-law scaling relationship: evaluation awareness increases predictably with model size
Linear probing methodology on steering vector activations to detect and quantify evaluation awareness

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timegrey_box

Applications

ai safety evaluationllm red-teaming

Read PDF arXiv Code

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SecureBreak -- A dataset towards safe and secure models

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

A Granular Study of Safety Pretraining under Model Abliteration

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Analysing the Safety Pitfalls of Steering Vectors

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models