Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against
Published on arXiv
2512.22293
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
Models fine-tuned on warning-framed vulnerable code reproduce it 76.7% of the time vs. 83.3% for direct exposure — statistically indistinguishable — because statistical co-occurrence of context and continuation dominates over pragmatic framing in current architectures.
CAFT
Novel technique introduced
Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
Key Contributions
- Empirical demonstration that warning-framed training data (76.7% reproduction rate) teaches warned-against behaviors nearly as effectively as direct exposure (83.3%), with no statistically significant difference.
- Mechanistic interpretability evidence via sparse autoencoders showing Feature #8684 activates at comparable magnitude in both warning and execution contexts — 'describing X' and 'performing X' share entangled latent features ('failure of orthogonalization').
- Characterization of the 'stealth slip' phenomenon, where conversational preambles rotate activations into subspaces that evade linear probe detection, and demonstration that only training-time feature ablation (CAFT) succeeds where prompting and inference-time steering fail.
🛡️ Threat Analysis
The core finding is that training data containing warning-framed harmful content (e.g., 'DO NOT USE — this code is vulnerable') acts as effective training signal for the warned-against behavior, indistinguishable from directly including the malicious content. The paper explicitly frames this as equivalent to data poisoning: 'pedagogically-framed data functions equivalently to direct malicious data.' This is a training-time data integrity failure.