Subliminal Signals in Preference Labels

As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other's training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.

Key Contributions

Demonstrates that binary preference labels constitute a covert communication channel in LLM-as-a-judge alignment pipelines, transmitting only one bit per sample yet inducing behavioral shifts
Shows a biased judge can transmit unintended behavioral traits to a semantically neutral student model through preference assignments alone, with completions unrelated to the target bias
Finds that subliminal preference transmission amplifies across iterative alignment rounds, posing a compounding risk in superalignment settings

🛡️ Threat Analysis

Data Poisoning Attack

The biased judge systematically corrupts the preference training dataset by encoding behavioral information into binary labels, constituting training-time data poisoning of the student model's fine-tuning data.

Transfer Learning Attack

The paper directly exploits the RLHF/preference alignment process — a biased judge manipulates preference assignments to embed unintended behavioral traits in the student model during alignment, which is explicitly listed under ML07 as 'RLHF/preference manipulation to embed malicious behavior'.