CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Zhuochen Yang 1,2, Kar Wai Fok 2, Vrizlynn L. L. Thing 2
Published on arXiv
2510.11137
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
CoSPED achieves 65.2% training data extraction rate at 50-token prefix comparison on GPT-Neo, outperforming prior work (CLM, Ethicist); ROME-based defense reduces extraction rate to 1.6%.
CoSPED
Novel technique introduced
Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.
Key Contributions
- CoSPED framework with three novel loss functions (Dynamic Loss, Additive Loss, Common Loss) and Self-Consistency Decoding strategy to improve consistency and extraction rate of soft prompt-based training data extraction attacks on LLMs
- Comprehensive cross-model evaluation on GPT-Neo and Pythia series, exploring 16 loss combinations and achieving 65.2% extraction rate at 50-token prefix comparison
- Defense evaluation using Rank-One Model Editing (ROME) that reduces extraction rate from 65.2% to 1.6%, demonstrating how extraction analysis can directly inform targeted mitigations
🛡️ Threat Analysis
CoSPED is a white-box training data extraction attack: an adversary optimizes soft prompts with novel loss functions (Dynamic, Additive, Common Loss) and a Self-Consistency Decoding strategy to reconstruct verbatim memorized content from LLMs — a direct model inversion / training data reconstruction attack.