attack 2026

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun 1,2, Minrui Luo 3,1, Yu Wang 4,1, Zhili Chen 2, Tianxing He 3,1,5

0 citations · 51 references · arXiv (Cornell University)

α

Published on arXiv

2602.10134

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves >99% subject recall rate and 88% semantic similarity when attacking Llama3-8B-Instruct on CounterFact, demonstrating that locate-then-edit methods inadvertently expose the data they aim to protect.

KSTER

Novel technique introduced


Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.


Key Contributions

  • Theoretical proof that the row space of a locate-then-edit parameter update matrix encodes a unique spectral fingerprint of the edited subject, enabling recovery via SVD
  • Two-stage KSTER attack: KeySpaceReconstruction (subject recovery via spectral analysis) followed by EntropyReduction (semantic prompt reconstruction)
  • Subspace camouflage defense that injects semantic decoys into the update subspace to obfuscate the algebraic fingerprint without degrading editing performance

🛡️ Threat Analysis

Model Inversion Attack

KSTER reconstructs private training/edited data (subjects and semantic context) from the weight delta matrix of locate-then-edit model edits — directly analogous to gradient inversion attacks, where an adversary recovers data from parameter differences rather than model outputs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Datasets
CounterFactZsRE
Applications
llm model editingprivacy protection via model editingknowledge editing