Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Jacob Dang 1, Brian Y. Xie 2, Omar G. Younis 3,4
Published on arXiv
2604.15559
Transfer Learning Attack
OWASP ML Top 10 — ML07
Data Poisoning Attack
OWASP ML Top 10 — ML02
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
Student agents inherit deletion bias reaching 100% (vs 5% baseline) under homogeneous distillation despite training only on safe trajectories with full keyword sanitation
Subliminal Behavioral Transfer
Novel technique introduced
Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.
Key Contributions
- First empirical demonstration that unsafe behavioral traits transfer subliminally through agent distillation even with sanitized training data
- Shows deletion bias reaches 100% in students (vs 5% baseline) despite keyword filtering in API setting, and 30-55% chmod-first rate in Bash setting
- Demonstrates behavioral transfer generalizes across tool interfaces from structured API calls to free-form shell commands
🛡️ Threat Analysis
The teacher agent's deletion bias propagates through the distillation training data (trajectories), even though explicit deletion keywords are filtered. This is a form of implicit data poisoning where the structural dynamics of trajectories encode unsafe behaviors without explicit unsafe content.
Paper demonstrates that behavioral biases survive the transfer learning/distillation process from teacher to student agents, even when training data is sanitized. The attack exploits the distillation pipeline itself, showing that unsafe behaviors persist through fine-tuning on ostensibly safe trajectories.