Sungryull Sohn

h-index: 10 747 citations 37 papers (total)

Papers in Database (1)

attack arXiv Jan 21, 2026 · 10w ago

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim et al. · University of Michigan · LG AI Research +1 more

Crafted agent chain-of-thought reasoning inflates LLM/VLM judge false positives by up to 90% across 800 web-task trajectories

Prompt Injection nlpmultimodal
1 citations PDF