attack 2025

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Sohely Jahan , Ruimin Sun

0 citations · 56 references · arXiv

Published on arXiv

2512.09403

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A surrogate trained via benign-only black-box distillation achieves 86% unsafe completions on adversarial prompts, far exceeding Meditron-7B (66%) and the untuned base model (46%), demonstrating that task utility transfers while safety alignment collapses.

Black-Box Behavioral Distillation

Novel technique introduced

As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.

Key Contributions

Black-box behavioral distillation attack on a safety-aligned medical LLM (Meditron-7B) using only 48K output queries and LoRA fine-tuning of a LLaMA3 8B surrogate at a cost of $12, with no access to weights or safety filters
Dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search jailbreak attacks to measure alignment collapse
Prototype layered defense system for real-time alignment drift detection in black-box medical LLM deployments

🛡️ Threat Analysis

Model Theft

The core attack extracts the domain-specific capabilities of Meditron-7B by issuing 48,000 queries, collecting 25,000 response pairs, and fine-tuning a LLaMA3 8B surrogate via LoRA — classic black-box model extraction with zero access to weights, safety filters, or training data.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

Meditron-7B (48K instruction queries, 25K response pairs collected)

Applications

medical llmsclinical decision support

Read PDF arXiv DOI

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Black-Box Guardrail Reverse-engineering Attack

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

How Vulnerable Are Edge LLMs?

Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models

How to Steal Reasoning Without Reasoning Traces

SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

EditMF: Drawing an Invisible Fingerprint for Your Large Language Models