defense 2025

Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Eduard Kapelko

0 citations · 9 references · arXiv

α

Published on arXiv

2509.25220

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Deception exhibits full functional regeneration after every ablation cycle in DistilGPT-2, while perplexity rises monotonically across 10 cycles, suggesting deceptive behavior is distributed and entangled rather than localizable.

Cyclic Ablation

Novel technique introduced


Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.


Key Contributions

  • Cyclic ablation methodology: an iterative loop of SAE-based feature identification, targeted weight ablation, and adversarial fine-tuning to test concept localizability in LLMs
  • Empirical demonstration of 'functional regeneration' — DistilGPT-2 consistently recovers deceptive behavior across 10 ablation cycles via adversarial re-training, contradicting the localization hypothesis
  • Evidence that repeated ablation causes cumulative, measurable degradation of general linguistic competence (rising perplexity), underscoring the cost of iterative model editing

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_time
Datasets
D_truth (1000 GPT-4-generated truthful statements)D_deception (1000 GPT-4-generated deceptive statements)
Applications
llm safety alignmentmodel editingmechanistic interpretability