attack 2025

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Zhuochen Yang 1,2, Kar Wai Fok 2, Vrizlynn L. L. Thing 2

0 citations · 45 references · arXiv

α

Published on arXiv

2510.11137

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

CoSPED achieves 65.2% training data extraction rate at 50-token prefix comparison on GPT-Neo, outperforming prior work (CLM, Ethicist); ROME-based defense reduces extraction rate to 1.6%.

CoSPED

Novel technique introduced


Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.


Key Contributions

  • CoSPED framework with three novel loss functions (Dynamic Loss, Additive Loss, Common Loss) and Self-Consistency Decoding strategy to improve consistency and extraction rate of soft prompt-based training data extraction attacks on LLMs
  • Comprehensive cross-model evaluation on GPT-Neo and Pythia series, exploring 16 loss combinations and achieving 65.2% extraction rate at 50-token prefix comparison
  • Defense evaluation using Rank-One Model Editing (ROME) that reduces extraction rate from 65.2% to 1.6%, demonstrating how extraction analysis can directly inform targeted mitigations

🛡️ Threat Analysis

Model Inversion Attack

CoSPED is a white-box training data extraction attack: an adversary optimizes soft prompts with novel loss functions (Dynamic, Additive, Common Loss) and a Self-Consistency Decoding strategy to reconstruct verbatim memorized content from LLMs — a direct model inversion / training data reconstruction attack.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
The Pile (implied via GPT-Neo/Pythia training data)GPT-Neo benchmarkPythia benchmark
Applications
llm training data extractionprivacy leakage evaluation in large language models