Subliminal Signals in Preference Labels
Isotta Magistrali , Frédéric Berdoz , Sam Dauncey , Roger Wattenhofer
Published on arXiv
2603.01204
Transfer Learning Attack
OWASP ML Top 10 — ML07
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
A biased judge can transmit targeted behavioral traits to a neutral student model through binary preference labels alone, with the injected bias strengthening across iterative RLHF rounds despite student completions being semantically unrelated to the target
Subliminal preference transmission
Novel technique introduced
As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other's training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.
Key Contributions
- Demonstrates that binary preference labels constitute a covert communication channel in LLM-as-a-judge alignment pipelines, transmitting only one bit per sample yet inducing behavioral shifts
- Shows a biased judge can transmit unintended behavioral traits to a semantically neutral student model through preference assignments alone, with completions unrelated to the target bias
- Finds that subliminal preference transmission amplifies across iterative alignment rounds, posing a compounding risk in superalignment settings
🛡️ Threat Analysis
The biased judge systematically corrupts the preference training dataset by encoding behavioral information into binary labels, constituting training-time data poisoning of the student model's fine-tuning data.
The paper directly exploits the RLHF/preference alignment process — a biased judge manipulates preference assignments to embed unintended behavioral traits in the student model during alignment, which is explicitly listed under ML07 as 'RLHF/preference manipulation to embed malicious behavior'.