defense 2025

Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks

Haowei Fu 1, Bo Ni 1, Han Xu 2, Kunpeng Liu 3, Dan Lin 1, Tyler Derr 1

0 citations · 32 references · arXiv

α

Published on arXiv

2512.03100

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

EPD reduces MIA success rate by up to 27.8% for SFT-based and 526.3% for RAG-based LLMs compared to the inference-time baseline, while maintaining answer quality.

EPD (Ensemble Privacy Defense)

Novel technique introduced


Retrieval-Augmented Generation (RAG) and Supervised Finetuning (SFT) have become the predominant paradigms for equipping Large Language Models (LLMs) with external knowledge for diverse, knowledge-intensive tasks. However, while such knowledge injection improves performance, it also exposes new attack surfaces. Membership Inference Attacks (MIAs), which aim to determine whether a given data sample was included in a model's training set, pose serious threats to privacy and trust in sensitive domains. To this end, we first systematically evaluate the vulnerability of RAG- and SFT-based LLMs to various MIAs. Then, to address the privacy risk, we further introduce a novel, model-agnostic defense framework, Ensemble Privacy Defense (EPD), which aggregates and evaluates the outputs of a knowledge-injected LLM, a base LLM, and a dedicated judge model to enhance resistance against MIAs. Comprehensive experiments show that, on average, EPD reduces MIA success by up to 27.8\% for SFT and 526.3\% for RAG compared to inference-time baseline, while maintaining answer quality.


Key Contributions

  • Systematic empirical evaluation of MIA vulnerability in both RAG-based and SFT-based LLMs across multiple attack variants
  • Ensemble Privacy Defense (EPD): a model-agnostic framework that aggregates outputs from a knowledge-injected LLM, a base LLM, and a judge model to resist membership inference
  • Demonstrates EPD reduces MIA success by up to 27.8% for SFT and 526.3% for RAG over inference-time baseline while preserving answer quality

🛡️ Threat Analysis

Membership Inference Attack

The paper's core focus is membership inference attacks — determining whether a given data sample was included in an LLM's training set (SFT) or RAG corpus. Both the attack evaluation and the EPD defense framework directly target this binary membership inference threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Applications
retrieval-augmented generationsupervised fine-tuningknowledge-intensive llms