Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.

Key Contributions

Cyclic ablation methodology: an iterative loop of SAE-based feature identification, targeted weight ablation, and adversarial fine-tuning to test concept localizability in LLMs
Empirical demonstration of 'functional regeneration' — DistilGPT-2 consistently recovers deceptive behavior across 10 ablation cycles via adversarial re-training, contradicting the localization hypothesis
Evidence that repeated ablation causes cumulative, measurable degradation of general linguistic competence (rising perplexity), underscoring the cost of iterative model editing