attack 2025

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li , Yuhao Wu , Hongliang Liu

0 citations · 36 references · arXiv

α

Published on arXiv

2512.17375

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Short low-perplexity control token sequences cause very high false positive rates on large open-weight and specialized LLM judge models scoring math/reasoning tasks, and LoRA adversarial training markedly reduces these false positives while preserving evaluation quality

AdvJudge-Zero

Novel technique introduced


Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.


Key Contributions

  • AdvJudge-Zero: a beam-search method over next-token distributions that discovers diverse low-perplexity control token sequences flipping LLM judge binary decisions without gradient access
  • Mechanistic analysis showing adversarial hidden-state perturbations concentrate in a low-rank 'soft mode' anti-aligned with the judge's refusal direction
  • LoRA-based adversarial training defense on small control-token-augmented example sets that markedly reduces false positives while preserving evaluation quality

🛡️ Threat Analysis

Input Manipulation Attack

Core attack crafts adversarial token sequences via beam-search over next-token distributions to flip LLM judge binary evaluations at inference time — a discrete input manipulation attack causing misclassification (false 'Yes' judgments) of otherwise correct 'No' decisions.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
grey_boxinference_timetargeteddigital
Datasets
math reasoning benchmarks
Applications
llm-as-a-judge evaluationreward modelsrlhf post-training pipelines