attack 2026

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

0 citations · 38 references · arXiv (Cornell University)

Published on arXiv

2602.13576

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Benchmark-compliant rubric edits reduce target-domain judge accuracy by up to 27.9% on harmlessness tasks, and this bias propagates through RLHF to produce persistent systematic drift in downstream aligned policies.

RIPD (Rubric-Induced Preference Drift)

Novel technique introduced

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

Key Contributions

Identifies Rubric-Induced Preference Drift (RIPD): a latent vulnerability where benchmark-compliant rubric edits produce systematic, directional shifts in an LLM judge's preferences on target domains while passing aggregate validation.
Demonstrates rubric-based preference attacks that reduce target-domain judge accuracy by up to 9.5% (helpfulness) and 27.9% (harmlessness) using only natural-language rubric modifications.
Shows that RIPD propagates through the Judge→Label→Alignment pipeline, causing persistent and systematic policy-level behavior drift in models trained on rubric-biased preference labels.

🛡️ Threat Analysis

Transfer Learning Attack

The attack's downstream consequence is RLHF/preference manipulation: rubric-drifted judges produce biased preference labels that become internalized during post-training, embedding persistent and systematic behavior drift in aligned policies — directly targeting the fine-tuning/RLHF process to corrupt aligned model behavior.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_timeinference_timetargeted

Datasets

MT-BenchRewardBench

Applications

llm evaluation pipelinesllm-as-a-judgerlhf alignmentpreference labeling

Read PDF arXiv DOI Code

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs