defense 2026

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel

0 citations

α

Published on arXiv

2603.03054

Membership Inference Attack

OWASP ML Top 10 — ML04

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

PrivMedChat provides formal differential privacy guarantees across the full RLHF pipeline, preventing high-confidence membership inference on rare patient records while maintaining utility and safety in medical dialogue tasks.

PrivMedChat (DP-RLHF)

Novel technique introduced


Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization, enabling membership inference and disclosure of rare training-set details. We present PrivMedChat (Private Medical Chat), an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue systems. Our approach enforces differential privacy at each training stage that accesses dialogue-derived supervision, combining DP-SGD for supervised fine-tuning and reward model learning from preference pairs, and DP-aware policy optimization for alignment. To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations. We evaluate PrivMedChat across medical dialogue tasks and assess utility, safety, and privacy under consistent privacy accounting, thereby providing a practical pathway to align medical chatbots while offering formal privacy guarantees. We open-source our code at https://github.com/sudip-bhujel/privmedchat.


Key Contributions

  • End-to-end DP-RLHF framework applying DP-SGD to SFT, reward model training, and PPO policy optimization for medical dialogue LLMs
  • Annotation-free preference construction strategy pairing physician responses with filtered non-expert LLM generations to avoid costly clinician labeling
  • Holistic privacy accounting across all training stages with empirical evaluation of utility, safety, and resistance to membership inference attacks

🛡️ Threat Analysis

Membership Inference Attack

The paper's stated adversarial threat is membership inference attacks (MIA) on medical LLMs — the threat model figure and keywords explicitly name MIA, and DP-RLHF is evaluated as a defense to prevent high-confidence membership inference on patients with rare symptoms.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
medical dialogue systemsclinical decision supportpatient-facing medical chatbots