α

Published on arXiv

2510.18554

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Embedding-based similarity reveals at least 10x more alignment training data memorization than string matching, and extracted data can be used to train a base model recovering meaningful downstream performance.

Chat Template Extraction with Embedding Similarity

Novel technique introduced


In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.


Key Contributions

  • Generalizes the Magpie chat-template prompting strategy to extract post-training alignment data (SFT/RL) from open-weight LLMs, confirming that models regurgitate their alignment training data.
  • Shows that approximate string matching underestimates memorization by at least 10x compared to embedding-based semantic similarity, making embedding models the superior metric for this type of extraction.
  • Demonstrates that extracted alignment data can be used to post-train a base model, recovering a meaningful portion of the original model's performance — exposing a distillation risk.

🛡️ Threat Analysis

Model Inversion Attack

The paper demonstrates an adversarial method to reconstruct/extract alignment training data (SFT, RL traces) from open-weight LLMs by prompting with chat templates and measuring semantic similarity — a training data extraction attack fitting the ML03 adversary test.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
SFT datasetsRL training datasetsmath/reasoning alignment corpora
Applications
llm alignment data protectionmodel distillation securitypost-training data confidentiality