attack 2026

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa 1,2, Lajanugen Logeswaran 2, Jaekyeom Kim 2, Sungryull Sohn 2, Yunxiang Zhang 1, Moontae Lee 2, Hao Peng 3, Lu Wang 1, Honglak Lee 1,2

1 citations · 44 references · arXiv

α

Published on arXiv

2601.14691

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Content-based CoT manipulation raises VLM judge false positive rates by up to 90%, exposing a fundamental vulnerability in trajectory-based LLM evaluation paradigms.

CoT Trajectory Manipulation

Novel technique introduced


Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.


Key Contributions

  • Shows that manipulating only the CoT reasoning trace — while holding actions and observations fixed — can inflate VLM judge false positive rates by up to 90% across 800 diverse web-task trajectories.
  • Introduces a taxonomy of manipulation strategies (style-based vs. content-based), finding content-based fabrication of task-progress signals consistently more effective.
  • Evaluates prompting-based defenses and judge-time compute scaling, showing they reduce but do not eliminate susceptibility to CoT manipulation.

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
llmvlm
Threat Tags
black_boxinference_timetargeted
Datasets
web agent task trajectories (800 trajectories)
Applications
llm-as-a-judge evaluationweb agent evaluationagentic ai benchmarking