attack 2025

GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Jie Zhu , Vi Ngoc-Nha Tran

The Arctic University of Norway

0 citations · 52 references · arXiv

Published on arXiv

2509.21192

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

GEP extracts up to 60× more PII than template-based attacks on ChatBioGPT and reveals a 4.53% leakage rate even when PII is inserted in free-style syntactic expressions.

GEP (GCG-based PII Extraction)

Novel technique introduced

Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.

Key Contributions

First study exploring PII leakage from SLM-based chatbots, demonstrating that template-based attacks are ineffective in this setting
GEP attack using GCG-based gradient optimization achieving up to 60× more PII extraction than template-based methods
Analysis of leakage rate vs. training steps, trigger token length, and PII position, providing insights for future defenses

🛡️ Threat Analysis

Model Inversion Attack

GEP is a training data extraction attack: an adversary uses greedy coordinate gradient optimization to recover PII that the model memorized during fine-tuning on medical datasets. This is a direct instance of recovering private training data from a model's outputs, matching the core definition of Model Inversion.

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxinference_timetargeted

Datasets

AlpacaHealthCareMagic

Applications

medical chatbotsslm-based chatbots

Read PDF arXiv DOI

GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Discovering Universal Activation Directions for PII Leakage in Language Models

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Language Models are Injective and Hence Invertible

Reverse-Engineering Model Editing on Language Models

Retracing the Past: LLMs Emit Training Data When They Get Lost

Extracting Training Dialogue Data from Large Language Model based Task Bots