attack 2025

Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko 1, Nikhil Reddy Billa 1, Adam Nguyen 1, Charles Fleming 2, Ming Jin 1, Ruoxi Jia 1

0 citations · 20 references · EMNLP

α

Published on arXiv

2511.05518

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

CIA achieves verbatim extraction rates up to 22.2% on Llama 2-70B and 18.8% on Llama 3-Instruct-70B without prior training data access, significantly outperforming existing baselines.

Confusion-Inducing Attacks (CIA) / Mismatched SFT

Novel technique introduced


The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.


Key Contributions

  • Identifies that verbatim training data emission is preceded by a sustained spike in token-level prediction entropy, providing a principled mechanistic understanding of memorization leakage.
  • Proposes Confusion-Inducing Attacks (CIA), which optimizes adversarial input snippets to maximize consecutive high-entropy states, reliably triggering training data regurgitation without prior knowledge of training data.
  • Proposes Mismatched SFT for aligned LLMs — fine-tuning on prompt-answer pairs with deliberately mismatched content to simultaneously weaken safety alignment and induce model confusion, increasing extraction susceptibility.

🛡️ Threat Analysis

Model Inversion Attack

The paper's primary contribution is a training-data extraction attack: an adversary reconstructs verbatim memorized training data from LLM outputs by optimizing prompts (CIA) and fine-tuning (Mismatched SFT) — a direct model inversion/data reconstruction threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Datasets
InfiniGram (extraction verification)Llama 1 (65B)Llama 2 (70B)Llama 3-Instruct (70B)Llama 3.1-Instruct (8B)
Applications
llm training data extractionprivacy auditing of language models