attack 2025

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Sohely Jahan , Ruimin Sun

0 citations · 56 references · arXiv

α

Published on arXiv

2512.09403

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A surrogate trained via benign-only black-box distillation achieves 86% unsafe completions on adversarial prompts, far exceeding Meditron-7B (66%) and the untuned base model (46%), demonstrating that task utility transfers while safety alignment collapses.

Black-Box Behavioral Distillation

Novel technique introduced


As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.


Key Contributions

  • Black-box behavioral distillation attack on a safety-aligned medical LLM (Meditron-7B) using only 48K output queries and LoRA fine-tuning of a LLaMA3 8B surrogate at a cost of $12, with no access to weights or safety filters
  • Dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search jailbreak attacks to measure alignment collapse
  • Prototype layered defense system for real-time alignment drift detection in black-box medical LLM deployments

🛡️ Threat Analysis

Model Theft

The core attack extracts the domain-specific capabilities of Meditron-7B by issuing 48,000 queries, collecting 25,000 response pairs, and fine-tuning a LLaMA3 8B surrogate via LoRA — classic black-box model extraction with zero access to weights, safety filters, or training data.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
Meditron-7B (48K instruction queries, 25K response pairs collected)
Applications
medical llmsclinical decision support