attack 2025

Extracting alignment data in open models

Federico Barbero ¹, Xiangming Gu ², Christopher A. Choquette-Choo ³, Chawin Sitawarin ³, Matthew Jagielski ⁴, Itay Yona ⁵, Petar Veličković ³, Ilia Shumailov ⁶, Jamie Hayes ³

¹ University of Oxford

² National University of Singapore

³ Google DeepMind

⁴ Anthropic

⁵ MentaLeap

⁶ AI Sequrity Company

4 citations · 67 references · arXiv

Published on arXiv

2510.18554

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Embedding-based similarity reveals at least 10x more alignment training data memorization than string matching, and extracted data can be used to train a base model recovering meaningful downstream performance.

Chat Template Extraction with Embedding Similarity

Novel technique introduced

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.

Key Contributions

Generalizes the Magpie chat-template prompting strategy to extract post-training alignment data (SFT/RL) from open-weight LLMs, confirming that models regurgitate their alignment training data.
Shows that approximate string matching underestimates memorization by at least 10x compared to embedding-based semantic similarity, making embedding models the superior metric for this type of extraction.
Demonstrates that extracted alignment data can be used to post-train a base model, recovering a meaningful portion of the original model's performance — exposing a distillation risk.

🛡️ Threat Analysis

Model Inversion Attack

The paper demonstrates an adversarial method to reconstruct/extract alignment training data (SFT, RL traces) from open-weight LLMs by prompting with chat templates and measuring semantic similarity — a training data extraction attack fitting the ML03 adversary test.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

SFT datasetsRL training datasetsmath/reasoning alignment corpora

Applications

llm alignment data protectionmodel distillation securitypost-training data confidentiality

Read PDF arXiv DOI

Extracting alignment data in open models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings

Extracting Training Dialogue Data from Large Language Model based Task Bots

OSNIP: Breaking the Privacy-Utility-Efficiency Trilemma in LLM Inference via Obfuscated Semantic Null Space

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Language Models are Injective and Hence Invertible

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization