attack 2026

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa ^1,2, Lajanugen Logeswaran ², Jaekyeom Kim ², Sungryull Sohn ², Yunxiang Zhang ¹, Moontae Lee ², Hao Peng ³, Lu Wang ¹, Honglak Lee ^1,2

¹ University of Michigan

² LG AI Research

³ University of Illinois Urbana-Champaign

1 citations · 44 references · arXiv

Published on arXiv

2601.14691

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Content-based CoT manipulation raises VLM judge false positive rates by up to 90%, exposing a fundamental vulnerability in trajectory-based LLM evaluation paradigms.

CoT Trajectory Manipulation

Novel technique introduced

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

Key Contributions

Shows that manipulating only the CoT reasoning trace — while holding actions and observations fixed — can inflate VLM judge false positive rates by up to 90% across 800 diverse web-task trajectories.
Introduces a taxonomy of manipulation strategies (style-based vs. content-based), finding content-based fabrication of task-progress signals consistently more effective.
Evaluates prompting-based defenses and judge-time compute scaling, showing they reduce but do not eliminate susceptibility to CoT manipulation.

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llmvlm

Threat Tags

black_boxinference_timetargeted

Datasets

web agent task trajectories (800 trajectories)

Applications

llm-as-a-judge evaluationweb agent evaluationagentic ai benchmarking

Read PDF arXiv DOI

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Text is All You Need for Vision-Language Model Jailbreaking

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

SafeMT: Multi-turn Safety for Multimodal Language Models

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models