attack 2025

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

Kanchon Gharami , Hansaka Aluvihare , Shafika Showkat Moni , Berker Peköz

Embry-Riddle Aeronautical University

0 citations

Published on arXiv

2509.00973

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

A 6-layer student model recovers 97.6% of the teacher's hidden-state geometry with a 7.31% perplexity increase using fewer than 10k black-box API queries and under 24 GPU-hours of compute.

SVD-Distillation LLM Clone Pipeline

Novel technique introduced

Large Language Models (LLMs) are increasingly deployed in mission-critical systems, facilitating tasks such as satellite operations, command-and-control, military decision support, and cyber defense. Many of these systems are accessed through application programming interfaces (APIs). When such APIs lack robust access controls, they can expose full or top-k logits, creating a significant and often overlooked attack surface. Prior art has mainly focused on reconstructing the output projection layer or distilling surface-level behaviors. However, regenerating a black-box model under tight query constraints remains underexplored. We address that gap by introducing a constrained replication pipeline that transforms partial logit leakage into a functional deployable substitute model clone. Our two-stage approach (i) reconstructs the output projection matrix by collecting top-k logits from under 10k black-box queries via singular value decomposition (SVD) over the logits, then (ii) distills the remaining architecture into compact student models with varying transformer depths, trained on an open source dataset. A 6-layer student recreates 97.6% of the 6-layer teacher model's hidden-state geometry, with only a 7.31% perplexity increase, and a 7.58 Negative Log-Likelihood (NLL). A 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction with comparable performance. The entire attack completes in under 24 graphics processing unit (GPU) hours and avoids triggering API rate-limit defenses. These results demonstrate how quickly a cost-limited adversary can clone an LLM, underscoring the urgent need for hardened inference APIs and secure on-premise defense deployments.

Key Contributions

Two-stage black-box LLM replication pipeline: (1) output projection matrix recovery via SVD over fewer than 10k top-k logit queries, then (2) knowledge distillation into compact student models of varying depths trained on open-source data
6-layer student replicates 97.6% of teacher hidden-state geometry with only 7.31% perplexity increase; 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction at comparable fidelity
Full attack completes in under 24 GPU-hours and stays below API rate-limit defenses, demonstrating practical threat to mission-critical LLM deployments

🛡️ Threat Analysis

Model Theft

Primary contribution is a black-box model extraction attack that reconstructs the output projection matrix via SVD on leaked top-k logits and distills a functional deployable clone — direct LLM intellectual property theft.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

open-source text corpus (unspecified in excerpt)

Applications

llm inference apismilitary decision support systemscommand-and-control interfaces

Read PDF arXiv

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models

Black-Box Guardrail Reverse-engineering Attack

How Vulnerable Are Edge LLMs?

How to Steal Reasoning Without Reasoning Traces

Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

Vulnerabilities in Partial TEE-Shielded LLM Inference with Precomputed Noise

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

EditMF: Drawing an Invisible Fingerprint for Your Large Language Models