Extracting books from production language models
Ahmed Ahmed 1, A. Feder Cooper 1,2, Sanmi Koyejo 1, Percy Liang 1
Published on arXiv
2601.02671
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Jailbroken Claude 3.7 Sonnet reproduces entire copyrighted books near-verbatim (nv-recall=95.8%), while Gemini 2.5 Pro and Grok 3 leak substantial text without any jailbreak (nv-recall 76.8% and 70.3% respectively), exposing training data extraction as a persistent risk in production LLMs despite safety measures.
Best-of-N (BoN) jailbreak + iterative continuation extraction
Novel technique introduced
Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
Key Contributions
- Two-phase extraction procedure: a Best-of-N jailbreak probe followed by iterative continuation prompts, demonstrating feasibility against production LLMs with safety systems
- Empirical evaluation on Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 showing divergent extraction success rates across providers (nv-recall ranging from 4.0% to 95.8%)
- A block-based longest-common-substring metric (nv-recall) for quantifying near-verbatim extraction of long-form copyrighted text
🛡️ Threat Analysis
The paper's primary contribution is demonstrating that production LLMs have memorized copyrighted training data that can be reconstructed verbatim — a textbook model inversion / training data extraction attack. The adversary test is clearly satisfied: there is an explicit adversary (the researchers) using model queries to recover training data (full books) from model weights.