defense 2025

CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs

Shashie Dilhara Batan Arachchige , Benjamin Zi Hao Zhao , Hassan Jameel Asghar , Dinusha Vatsalan , Dali Kaafar

0 citations · 69 references · Annual Computer Security Appli...

α

Published on arXiv

2512.12914

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

CTIGuardian achieves a better privacy-utility trade-off than NER-based Presidio baseline when defending GPT-4o mini and Mistral-7B Instruct models fine-tuned on sensitive CTI data against data-extraction attacks.

CTIGuardian (privacy alignment)

Novel technique introduced


Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.


Key Contributions

  • Construction of a new QA-style CTI dataset from APT reports with naturally occurring sensitive entities (IOCs, CVEs, CWEs) for ground-truth privacy leakage evaluation
  • Demonstration that prefix-based data-extraction attacks can recover sensitive training data from fine-tuned CTI LLMs
  • CTIGuardian: a few-shot 'privacy alignment' framework integrating an LLM-based privacy classifier and redactor, achieving a better privacy-utility trade-off than NER-based baselines like Presidio

🛡️ Threat Analysis

Model Inversion Attack

The core threat is an adversary recovering sensitive training data (IOCs, file hashes, IP addresses, emails) from a fine-tuned LLM via prefix-based data extraction attacks — a direct model inversion / training data reconstruction attack scenario with a concrete adversarial threat model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetraining_time
Datasets
Custom CTI QA dataset from APT reports (CVE/CWE-mapped)
Applications
cyber threat intelligencefine-tuned llm deploymententerprise knowledge bases