Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.

Key Contributions

Empirical demonstration that warning-framed training data (76.7% reproduction rate) teaches warned-against behaviors nearly as effectively as direct exposure (83.3%), with no statistically significant difference.
Mechanistic interpretability evidence via sparse autoencoders showing Feature #8684 activates at comparable magnitude in both warning and execution contexts — 'describing X' and 'performing X' share entangled latent features ('failure of orthogonalization').
Characterization of the 'stealth slip' phenomenon, where conversational preambles rotate activations into subspaces that evade linear probe detection, and demonstration that only training-time feature ablation (CAFT) succeeds where prompting and inference-time steering fail.

🛡️ Threat Analysis

Data Poisoning Attack

The core finding is that training data containing warning-framed harmful content (e.g., 'DO NOT USE — this code is vulnerable') acts as effective training signal for the warned-against behavior, indistinguishable from directly including the malicious content. The paper explicitly frames this as equivalent to data poisoning: 'pedagogically-framed data functions equivalently to direct malicious data.' This is a training-time data integrity failure.