Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.

Key Contributions

First empirical demonstration that unsafe behavioral traits transfer subliminally through agent distillation even with sanitized training data
Shows deletion bias reaches 100% in students (vs 5% baseline) despite keyword filtering in API setting, and 30-55% chmod-first rate in Bash setting
Demonstrates behavioral transfer generalizes across tool interfaces from structured API calls to free-form shell commands

🛡️ Threat Analysis

Data Poisoning Attack

The teacher agent's deletion bias propagates through the distillation training data (trajectories), even though explicit deletion keywords are filtered. This is a form of implicit data poisoning where the structural dynamics of trajectories encode unsafe behaviors without explicit unsafe content.

Transfer Learning Attack

Paper demonstrates that behavioral biases survive the transfer learning/distillation process from teacher to student agents, even when training data is sanitized. The attack exploits the distillation pipeline itself, showing that unsafe behaviors persist through fine-tuning on ostensibly safe trajectories.