attack 2025

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Yerin Hwang 1, Dongryeol Lee 1, Taegwan Kang 2, Yongil Kim 2, Kyomin Jung 1,2

0 citations

α

Published on arXiv

2508.07805

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Strategically embedded persuasive language causes LLM judges to inflate scores on incorrect math solutions by up to 8% on average, with the effect surviving counter-prompting defenses and amplified when multiple techniques are combined.


As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle's rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.


Key Contributions

  • Formalizes seven Aristotle-grounded persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) as adversarial attacks on LLM-as-a-Judge systems
  • Demonstrates that persuasive language inflates LLM judge scores on incorrect mathematical solutions by up to 8% on average across six benchmarks, with Consistency causing the most distortion
  • Shows the attack persists against counter-prompting defenses and that combining techniques amplifies bias, while larger model size provides negligible protection

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
MATHGSM8K
Applications
llm-as-a-judge evaluation pipelinesautomated mathematical reasoning grading