Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation
Yinjie Cheng 1, Paul Youssef 2, Christin Seifert 2, Jörg Schlötterer 2, Zhixue Zhao 1
Published on arXiv
2511.05852
Transfer Learning Attack
OWASP ML Top 10 — ML07
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Fine-tuning generally degrades knowledge edits, but malicious edits can persist and transfer to fine-tuned models, posing a safety risk; AlphaEdit edits decay more than MEMIT edits under fine-tuning
Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
Key Contributions
- First systematic empirical study of knowledge edit survival under fine-tuning across 5 LLMs, 2 KE methods (MEMIT, AlphaEdit), and 3 fine-tuning approaches (full, LoRA, DoRA) yielding 232 configurations
- Finds edits generally decay after fine-tuning but malicious edits can persist and propagate into fine-tuned models, with edit survival varying by method and model size
- Shows that fine-tuning only the edited layers effectively removes edits, while fine-tuning non-edited layers impairs more edits than full fine-tuning
🛡️ Threat Analysis
Core investigation is whether edits — including malicious ones injected via knowledge editing — survive subsequent fine-tuning and adapter-based adaptation, directly addressing ML07's concern of hidden behaviors persisting through the fine-tuning process.
Paper explicitly models malicious knowledge edits (BadEdit, misinformation injection, biasing) as a form of model poisoning, evaluating whether such hidden targeted behaviors are retained or neutralized after fine-tuning across five LLMs.