Reverse-Engineering Model Editing on Language Models
Zhiyu Sun 1,2, Minrui Luo 3,1, Yu Wang 4,1, Zhili Chen 2, Tianxing He 3,1,5
Published on arXiv
2602.10134
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Achieves >99% subject recall rate and 88% semantic similarity when attacking Llama3-8B-Instruct on CounterFact, demonstrating that locate-then-edit methods inadvertently expose the data they aim to protect.
KSTER
Novel technique introduced
Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.
Key Contributions
- Theoretical proof that the row space of a locate-then-edit parameter update matrix encodes a unique spectral fingerprint of the edited subject, enabling recovery via SVD
- Two-stage KSTER attack: KeySpaceReconstruction (subject recovery via spectral analysis) followed by EntropyReduction (semantic prompt reconstruction)
- Subspace camouflage defense that injects semantic decoys into the update subspace to obfuscate the algebraic fingerprint without degrading editing performance
🛡️ Threat Analysis
KSTER reconstructs private training/edited data (subjects and semantic context) from the weight delta matrix of locate-then-edit model edits — directly analogous to gradient inversion attacks, where an adversary recovers data from parameter differences rather than model outputs.