attack 2025

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina , Niranjan Chebrolu , Kokil Jaidka

0 citations · 36 references · arXiv

α

Published on arXiv

2510.26829

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Moderate poisoning (50–100%) during continual pre-training flips over 55% of LLM responses from correct to counterfactual while leaving alignment benchmarks largely intact, with belief corruptions concentrating in late layers and remaining partially reversible via activation patching.


We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment benchmarks largely intact and transferring imperfectly across languages. These results expose a failure mode of continual pre-training in which targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.


Key Contributions

  • Demonstrates that moderate continual pre-training poisoning (50–100% counterfactual ratio) flips over 55% of LLM factual responses without degrading alignment benchmarks or overall performance.
  • Localizes belief corruption to late transformer layers (e.g., layers 29–36 in 3B models) using logit lens and activation patching, showing partial reversibility up to 56.8%.
  • Shows corrupted beliefs generalize beyond poisoned prompts, selectively degrade commonsense reasoning, and transfer imperfectly across languages — exposing a subtle, undetected failure mode of continual pre-training.

🛡️ Threat Analysis

Data Poisoning Attack

Core contribution is demonstrating that injecting counterfactual misinformation into continual pre-training corpora poisons the model's factual knowledge — a training-data poisoning attack that corrupts specific factual representations without broad performance degradation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigital
Datasets
TruthfulQAcustom paired fact-counterfact dataset
Applications
large language modelscontinual pre-training pipelinesfactual knowledge retrieval