Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks

Retrieval-Augmented Generation (RAG) and Supervised Finetuning (SFT) have become the predominant paradigms for equipping Large Language Models (LLMs) with external knowledge for diverse, knowledge-intensive tasks. However, while such knowledge injection improves performance, it also exposes new attack surfaces. Membership Inference Attacks (MIAs), which aim to determine whether a given data sample was included in a model's training set, pose serious threats to privacy and trust in sensitive domains. To this end, we first systematically evaluate the vulnerability of RAG- and SFT-based LLMs to various MIAs. Then, to address the privacy risk, we further introduce a novel, model-agnostic defense framework, Ensemble Privacy Defense (EPD), which aggregates and evaluates the outputs of a knowledge-injected LLM, a base LLM, and a dedicated judge model to enhance resistance against MIAs. Comprehensive experiments show that, on average, EPD reduces MIA success by up to 27.8\% for SFT and 526.3\% for RAG compared to inference-time baseline, while maintaining answer quality.

Key Contributions

Systematic empirical evaluation of MIA vulnerability in both RAG-based and SFT-based LLMs across multiple attack variants
Ensemble Privacy Defense (EPD): a model-agnostic framework that aggregates outputs from a knowledge-injected LLM, a base LLM, and a judge model to resist membership inference
Demonstrates EPD reduces MIA success by up to 27.8% for SFT and 526.3% for RAG over inference-time baseline while preserving answer quality

🛡️ Threat Analysis

Membership Inference Attack

The paper's core focus is membership inference attacks — determining whether a given data sample was included in an LLM's training set (SFT) or RAG corpus. Both the attack evaluation and the EPD defense framework directly target this binary membership inference threat.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

2025 0 cit.

Membership Inference Attack

82%