Extracting alignment data in open models
Federico Barbero 1, Xiangming Gu 2, Christopher A. Choquette-Choo 3, Chawin Sitawarin 3, Matthew Jagielski 4, Itay Yona 5, Petar Veličković 3, Ilia Shumailov 6, Jamie Hayes 3
Published on arXiv
2510.18554
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Embedding-based similarity reveals at least 10x more alignment training data memorization than string matching, and extracted data can be used to train a base model recovering meaningful downstream performance.
Chat Template Extraction with Embedding Similarity
Novel technique introduced
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.
Key Contributions
- Generalizes the Magpie chat-template prompting strategy to extract post-training alignment data (SFT/RL) from open-weight LLMs, confirming that models regurgitate their alignment training data.
- Shows that approximate string matching underestimates memorization by at least 10x compared to embedding-based semantic similarity, making embedding models the superior metric for this type of extraction.
- Demonstrates that extracted alignment data can be used to post-train a base model, recovering a meaningful portion of the original model's performance — exposing a distillation risk.
🛡️ Threat Analysis
The paper demonstrates an adversarial method to reconstruct/extract alignment training data (SFT, RL traces) from open-weight LLMs by prompting with chat templates and measuring semantic similarity — a training data extraction attack fitting the ML03 adversary test.