attack 2025

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

Kanchon Gharami , Hansaka Aluvihare , Shafika Showkat Moni , Berker Peköz

0 citations

α

Published on arXiv

2509.00973

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

A 6-layer student model recovers 97.6% of the teacher's hidden-state geometry with a 7.31% perplexity increase using fewer than 10k black-box API queries and under 24 GPU-hours of compute.

SVD-Distillation LLM Clone Pipeline

Novel technique introduced


Large Language Models (LLMs) are increasingly deployed in mission-critical systems, facilitating tasks such as satellite operations, command-and-control, military decision support, and cyber defense. Many of these systems are accessed through application programming interfaces (APIs). When such APIs lack robust access controls, they can expose full or top-k logits, creating a significant and often overlooked attack surface. Prior art has mainly focused on reconstructing the output projection layer or distilling surface-level behaviors. However, regenerating a black-box model under tight query constraints remains underexplored. We address that gap by introducing a constrained replication pipeline that transforms partial logit leakage into a functional deployable substitute model clone. Our two-stage approach (i) reconstructs the output projection matrix by collecting top-k logits from under 10k black-box queries via singular value decomposition (SVD) over the logits, then (ii) distills the remaining architecture into compact student models with varying transformer depths, trained on an open source dataset. A 6-layer student recreates 97.6% of the 6-layer teacher model's hidden-state geometry, with only a 7.31% perplexity increase, and a 7.58 Negative Log-Likelihood (NLL). A 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction with comparable performance. The entire attack completes in under 24 graphics processing unit (GPU) hours and avoids triggering API rate-limit defenses. These results demonstrate how quickly a cost-limited adversary can clone an LLM, underscoring the urgent need for hardened inference APIs and secure on-premise defense deployments.


Key Contributions

  • Two-stage black-box LLM replication pipeline: (1) output projection matrix recovery via SVD over fewer than 10k top-k logit queries, then (2) knowledge distillation into compact student models of varying depths trained on open-source data
  • 6-layer student replicates 97.6% of teacher hidden-state geometry with only 7.31% perplexity increase; 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction at comparable fidelity
  • Full attack completes in under 24 GPU-hours and stays below API rate-limit defenses, demonstrating practical threat to mission-critical LLM deployments

🛡️ Threat Analysis

Model Theft

Primary contribution is a black-box model extraction attack that reconstructs the output projection matrix via SVD on leaked top-k logits and distills a functional deployable clone — direct LLM intellectual property theft.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
open-source text corpus (unspecified in excerpt)
Applications
llm inference apismilitary decision support systemscommand-and-control interfaces