benchmark 2025

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang , David Mohaisen

2 citations · 26 references · arXiv

α

Published on arXiv

2511.20799

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Multi-prefix memorization reliably distinguishes memorized from non-memorized sequences in aligned chat models where conventional single-prefix definitions create an illusion of compliance.

Multi-Prefix Memorization (MPM)

Novel technique introduced


Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.


Key Contributions

  • Multi-prefix memorization framework: redefines memorization by the number of distinct adversarial prefixes that elicit a target sequence, measuring depth of encoding rather than single-path elicitability.
  • Two-stage detection scheme combining an internal memorization score (η) with a data-dependent burden of proof (P) specifying how many unique prefixes must be found.
  • Empirical validation on open-source and aligned chat models showing the framework reliably distinguishes memorized from non-memorized content where single-prefix approaches (discoverable memorization) fail.

🛡️ Threat Analysis

Model Inversion Attack

The framework uses adversarial prefix search to extract verbatim training data from LLMs — a concrete model inversion / training data reconstruction threat model where an adversary recovers memorized content from model outputs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
large language modelstraining data leakage auditingprivacy compliance