Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs
Published on arXiv
2512.09403
Model Theft
OWASP ML Top 10 — ML05
Model Theft
OWASP LLM Top 10 — LLM10
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A surrogate trained via benign-only black-box distillation achieves 86% unsafe completions on adversarial prompts, far exceeding Meditron-7B (66%) and the untuned base model (46%), demonstrating that task utility transfers while safety alignment collapses.
Black-Box Behavioral Distillation
Novel technique introduced
As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.
Key Contributions
- Black-box behavioral distillation attack on a safety-aligned medical LLM (Meditron-7B) using only 48K output queries and LoRA fine-tuning of a LLaMA3 8B surrogate at a cost of $12, with no access to weights or safety filters
- Dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search jailbreak attacks to measure alignment collapse
- Prototype layered defense system for real-time alignment drift detection in black-box medical LLM deployments
🛡️ Threat Analysis
The core attack extracts the domain-specific capabilities of Meditron-7B by issuing 48,000 queries, collecting 25,000 response pairs, and fine-tuning a LLaMA3 8B surrogate via LoRA — classic black-box model extraction with zero access to weights, safety filters, or training data.