attack 2026

Subliminal Signals in Preference Labels

Isotta Magistrali , Frédéric Berdoz , Sam Dauncey , Roger Wattenhofer

0 citations

α

Published on arXiv

2603.01204

Transfer Learning Attack

OWASP ML Top 10 — ML07

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

A biased judge can transmit targeted behavioral traits to a neutral student model through binary preference labels alone, with the injected bias strengthening across iterative RLHF rounds despite student completions being semantically unrelated to the target

Subliminal preference transmission

Novel technique introduced


As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other's training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.


Key Contributions

  • Demonstrates that binary preference labels constitute a covert communication channel in LLM-as-a-judge alignment pipelines, transmitting only one bit per sample yet inducing behavioral shifts
  • Shows a biased judge can transmit unintended behavioral traits to a semantically neutral student model through preference assignments alone, with completions unrelated to the target bias
  • Finds that subliminal preference transmission amplifies across iterative alignment rounds, posing a compounding risk in superalignment settings

🛡️ Threat Analysis

Data Poisoning Attack

The biased judge systematically corrupts the preference training dataset by encoding behavioral information into binary labels, constituting training-time data poisoning of the student model's fine-tuning data.

Transfer Learning Attack

The paper directly exploits the RLHF/preference alignment process — a biased judge manipulates preference assignments to embed unintended behavioral traits in the student model during alignment, which is explicitly listed under ML07 as 'RLHF/preference manipulation to embed malicious behavior'.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargetedblack_box
Applications
llm alignmentrlhf pipelinesai oversight and superalignment systems