GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Published on arXiv
2509.21192
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
GEP extracts up to 60× more PII than template-based attacks on ChatBioGPT and reveals a 4.53% leakage rate even when PII is inserted in free-style syntactic expressions.
GEP (GCG-based PII Extraction)
Novel technique introduced
Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.
Key Contributions
- First study exploring PII leakage from SLM-based chatbots, demonstrating that template-based attacks are ineffective in this setting
- GEP attack using GCG-based gradient optimization achieving up to 60× more PII extraction than template-based methods
- Analysis of leakage rate vs. training steps, trigger token length, and PII position, providing insights for future defenses
🛡️ Threat Analysis
GEP is a training data extraction attack: an adversary uses greedy coordinate gradient optimization to recover PII that the model memorized during fine-tuning on medical datasets. This is a direct instance of recovering private training data from a model's outputs, matching the core definition of Model Inversion.