attack 2026

Reverse-Engineering Model Editing on Language Models

0 citations · 51 references · arXiv (Cornell University)

Published on arXiv

2602.10134

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves >99% subject recall rate and 88% semantic similarity when attacking Llama3-8B-Instruct on CounterFact, demonstrating that locate-then-edit methods inadvertently expose the data they aim to protect.

KSTER

Novel technique introduced

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAtk.git.

Key Contributions

Theoretical proof that the row space of a locate-then-edit parameter update matrix encodes a unique spectral fingerprint of the edited subject, enabling recovery via SVD
Two-stage KSTER attack: KeySpaceReconstruction (subject recovery via spectral analysis) followed by EntropyReduction (semantic prompt reconstruction)
Subspace camouflage defense that injects semantic decoys into the update subspace to obfuscate the algebraic fingerprint without degrading editing performance

🛡️ Threat Analysis

Model Inversion Attack

KSTER reconstructs private training/edited data (subjects and semantic context) from the weight delta matrix of locate-then-edit model edits — directly analogous to gradient inversion attacks, where an adversary recovers data from parameter differences rather than model outputs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

CounterFactZsRE

Applications

llm model editingprivacy protection via model editingknowledge editing

Read PDF arXiv DOI Code

Reverse-Engineering Model Editing on Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Language Models are Injective and Hence Invertible

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Retracing the Past: LLMs Emit Training Data When They Get Lost

Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory

Discovering Universal Activation Directions for PII Leakage in Language Models

Expert Selections In MoE Models Reveal (Almost) As Much As Text

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage