attack 2026

Extracting books from production language models

Ahmed Ahmed ¹, A. Feder Cooper ^1,2, Sanmi Koyejo ¹, Percy Liang ¹

¹ Stanford University

² Yale University

5 citations · 73 references · arXiv

Published on arXiv

2601.02671

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Jailbroken Claude 3.7 Sonnet reproduces entire copyrighted books near-verbatim (nv-recall=95.8%), while Gemini 2.5 Pro and Grok 3 leak substantial text without any jailbreak (nv-recall 76.8% and 70.3% respectively), exposing training data extraction as a persistent risk in production LLMs despite safety measures.

Best-of-N (BoN) jailbreak + iterative continuation extraction

Novel technique introduced

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

Key Contributions

Two-phase extraction procedure: a Best-of-N jailbreak probe followed by iterative continuation prompts, demonstrating feasibility against production LLMs with safety systems
Empirical evaluation on Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 showing divergent extraction success rates across providers (nv-recall ranging from 4.0% to 95.8%)
A block-based longest-common-substring metric (nv-recall) for quantifying near-verbatim extraction of long-form copyrighted text

🛡️ Threat Analysis

Model Inversion Attack

The paper's primary contribution is demonstrating that production LLMs have memorized copyrighted training data that can be reconstructed verbatim — a textbook model inversion / training data extraction attack. The adversary test is clearly satisfied: there is an explicit adversary (the researchers) using model queries to recover training data (full books) from model weights.

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

Harry Potter and the Sorcerer's StoneFrankensteinThe Great Gatsby

Applications

production llm apiscopyright compliance

Read PDF arXiv DOI

Extracting books from production language models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tricking LLM-Based NPCs into Spilling Secrets

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

CLIOPATRA: Extracting Private Information from LLM Insights

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Enterprise AI Must Enforce Participant-Aware Access Control