benchmark 2025

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

0 citations · 15 references · arXiv

α

Published on arXiv

2512.22293

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Models fine-tuned on warning-framed vulnerable code reproduce it 76.7% of the time vs. 83.3% for direct exposure — statistically indistinguishable — because statistical co-occurrence of context and continuation dominates over pragmatic framing in current architectures.

CAFT

Novel technique introduced


Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.


Key Contributions

  • Empirical demonstration that warning-framed training data (76.7% reproduction rate) teaches warned-against behaviors nearly as effectively as direct exposure (83.3%), with no statistically significant difference.
  • Mechanistic interpretability evidence via sparse autoencoders showing Feature #8684 activates at comparable magnitude in both warning and execution contexts — 'describing X' and 'performing X' share entangled latent features ('failure of orthogonalization').
  • Characterization of the 'stealth slip' phenomenon, where conversational preambles rotate activations into subspaces that evade linear probe detection, and demonstration that only training-time feature ablation (CAFT) succeeds where prompting and inference-time steering fail.

🛡️ Threat Analysis

Data Poisoning Attack

The core finding is that training data containing warning-framed harmful content (e.g., 'DO NOT USE — this code is vulnerable') acts as effective training signal for the warned-against behavior, indistinguishable from directly including the malicious content. The paper explicitly frames this as equivalent to data poisoning: 'pedagogically-framed data functions equivalently to direct malicious data.' This is a training-time data integrity failure.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
code generationllm safety trainingdata curation for safety