Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to show a model is sufficiently protected against broader memorization risks. However, we show that indistinguishability properties are neither sufficient nor necessary for preventing data extraction in LLM APIs. We formalize a privacy-game separation between extraction and indistinguishability-based privacy, showing that indistinguishability and inextractability are incomparable: upper-bounding distinguishability does not upper-bound extractability. To address this gap, we introduce $(l, b)$-inextractability as a definition that requires at least $2^b$ expected queries for any black-box adversary to induce the LLM API to emit a protected $l$-gram substring. We instantiate this via a worst-case extraction game and derive a rank-based extraction risk upper bound for targeted exact extraction, as well as extensions to cover untargeted and approximate extraction. The resulting estimator captures the extraction risk over multiple attack trials and prefix adaptations. We show that it can provide a tight and efficient estimation for standard greedy extraction and an upper bound on the probabilistic extraction risk given any decoding configuration. We empirically evaluate extractability across different models, clarifying its connection to distinguishability, demonstrating its advantage over existing extraction risk estimators, and providing actionable mitigation guidelines across model training, API access, and decoding configurations in LLM API deployment. Our code is publicly available at: https://github.com/Emory-AIMS/Inextractability.

Key Contributions

Formalized (l,b)-inextractability definition requiring 2^b queries to extract l-gram substrings
Proved privacy-game separation: indistinguishability and inextractability are incomparable properties
Developed rank-based extraction risk estimator covering targeted/untargeted and exact/approximate extraction

🛡️ Threat Analysis

Model Inversion Attack

Paper addresses training data extraction attacks from LLMs—adversaries recovering protected n-grams from model outputs. Proposes (l,b)-inextractability definition and rank-based estimator to measure extraction risk.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeteduntargeted

Applications

2026 0 cit.

Model Inversion Attack

80%

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Extracting Training Dialogue Data from Large Language Model based Task Bots

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Personal Information Parroting in Language Models

Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques