benchmark 2025

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir 1, Sarvesh Bhatnagar 2

2 citations · 17 references · arXiv

α

Published on arXiv

2510.19152

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Alignment in GPT-2 fails via a sharp phase transition at a critical fraction of poisoned synthetic training data, and corruption spreads beyond the targeted trait to degrade overall model alignment.

Subliminal Corruption

Novel technique introduced


As machine learning models are increasingly fine-tuned on synthetic data, there is a critical risk of subtle misalignments spreading through interconnected AI systems. This paper investigates subliminal corruption, which we define as undesirable traits are transmitted through semantically neutral data, bypassing standard safety checks. While this phenomenon has been identified, a quantitative understanding of its dynamics is missing. To address this gap, we present a systematic study of the scaling laws, thresholds, and mechanisms of subliminal corruption using a teacher-student setup with GPT-2. Our experiments reveal three key findings: (1) subliminal corruption causes behavioral crossover, degrading the model's overall alignment, not just the targeted trait; (2) alignment fails in a sharp phase transition at a critical threshold of poisoned data, rather than degrading gradually; and (3) interpretability analysis shows the corruption mechanism mimics the model's natural fine-tuning process, making it difficult to detect. These results demonstrate a critical vulnerability in AI systems that rely on synthetic data and highlight the need for new safety protocols that can account for latent threats.


Key Contributions

  • Demonstrates that subliminal corruption causes behavioral crossover — degrading overall model alignment beyond just the targeted trait
  • Shows alignment failure follows a sharp phase transition at a critical threshold of poisoned data rather than gradual degradation
  • Interpretability analysis reveals that corruption mechanism is indistinguishable from natural fine-tuning, making it resistant to standard detection

🛡️ Threat Analysis

Data Poisoning Attack

The paper's primary subject is corrupting training data — specifically synthetic data that appears semantically neutral but carries subtle misalignments — causing model behavior degradation. It studies scaling laws, poison thresholds, and mechanisms of this training-time data poisoning attack in a teacher-student setup.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Datasets
GPT-2 teacher-student synthetic data
Applications
llm fine-tuning on synthetic dataai alignment systems