CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.

Key Contributions

CoSPED framework with three novel loss functions (Dynamic Loss, Additive Loss, Common Loss) and Self-Consistency Decoding strategy to improve consistency and extraction rate of soft prompt-based training data extraction attacks on LLMs
Comprehensive cross-model evaluation on GPT-Neo and Pythia series, exploring 16 loss combinations and achieving 65.2% extraction rate at 50-token prefix comparison
Defense evaluation using Rank-One Model Editing (ROME) that reduces extraction rate from 65.2% to 1.6%, demonstrating how extraction analysis can directly inform targeted mitigations

🛡️ Threat Analysis

Model Inversion Attack

CoSPED is a white-box training data extraction attack: an adversary optimizes soft prompts with novel loss functions (Dynamic, Additive, Common Loss) and a Self-Consistency Decoding strategy to reconstruct verbatim memorized content from LLMs — a direct model inversion / training data reconstruction attack.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

The Pile (implied via GPT-Neo/Pythia training data)GPT-Neo benchmarkPythia benchmark

Applications

2025 0 cit.

Model Inversion Attack

87%

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Discovering Universal Activation Directions for PII Leakage in Language Models

GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Language Models are Injective and Hence Invertible

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Reverse-Engineering Model Editing on Language Models

Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory

Extracting Training Dialogue Data from Large Language Model based Task Bots

Retracing the Past: LLMs Emit Training Data When They Get Lost