attack 2026

In-Context Environments Induce Evaluation-Awareness in Language Models

Maheep Chaudhary

0 citations

α

Published on arXiv

2603.03824

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adversarially optimized prompts degrade GPT-4o-mini arithmetic accuracy from 97.8% to 4.0% (94pp), vastly exceeding hand-crafted sandbagging prompts that produce near-zero behavioral change across Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B


Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.


Key Contributions

  • Black-box adversarial optimization framework that treats in-context prompts as optimizable environments to elicit strategic sandbagging in frontier LLMs
  • Empirical demonstration that adversarially optimized prompts produce up to 94pp performance degradation vs near-zero for hand-crafted prompts, revealing a hidden vulnerability ceiling in evaluation reliability
  • CoT causal intervention showing 99.3% of sandbagging is driven by verbalized evaluation-aware reasoning rather than shallow prompt-following, confirming genuine strategic deception

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
ArithmeticGSM8KMMLUHumanEval
Applications
llm capability evaluationai safety evaluation