ML Security Papers

Latest papers

53 papers

attack arXiv Mar 25, 2026 · 14d ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp

PDF

defense arXiv Mar 19, 2026 · 20d ago

Functional Subspace Watermarking for Large Language Models

Zikang Ding, Junhao Li, Suling Wu et al. · University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence +1 more

Embeds ownership watermarks in a low-dimensional functional subspace of LLM weights, surviving fine-tuning, quantization, and distillation attacks

Model Theft Model Theft nlp

PDF

benchmark arXiv Mar 8, 2026 · 4w ago

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang · Temple University

Systematically evaluates nine output-level defenses against LLM distillation theft, finding most fail except chain-of-thought removal for math

Model Theft Model Theft nlp

PDF

attack arXiv Mar 7, 2026 · 4w ago

How to Steal Reasoning Without Reasoning Traces

Tingwei Zhang, John X. Morris, Vitaly Shmatikov · Cornell Tech

Steals LLM reasoning capabilities by synthesizing hidden chains-of-thought from black-box answers and summaries alone

Model Theft Model Theft nlp

PDF

defense arXiv Feb 21, 2026 · 6w ago

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Chengwei Xia, Fan Ma, Ruijie Quan et al. · Lanzhou University · arXiv +2 more

Adversarially-optimized trigger images that verify MLLM copyright by eliciting ownership text only in fine-tuned derivatives

Model Theft Model Theft multimodalnlp

PDF

defense arXiv Feb 16, 2026 · 7w ago

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang et al. · Washington University in St. Louis

Defends LLM APIs against unauthorized knowledge distillation by rewriting reasoning traces to degrade student training and embed watermarks.

Model Theft Model Theft nlp

PDF

attack arXiv Feb 11, 2026 · 8w ago

Vulnerabilities in Partial TEE-Shielded LLM Inference with Precomputed Noise

Abhishek Saini, Haolin Jiang, Hang Liu · The State University of New Jersey

Exploits key-reuse in TEE-shielded LLM inference to recover LLaMA-3 8B model weights in 6 minutes and bypass integrity checks

Model Theft Model Theft nlp

PDF

defense arXiv Feb 10, 2026 · 8w ago

A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors

Zhenyu Xu, Victor S. Sheng · Texas Tech University

Fingerprints LLMs for provenance tracking using internal refusal vectors, achieving 100% accuracy across 76 derivative models

Model Theft Model Theft nlp

PDF

defense arXiv Feb 9, 2026 · 8w ago

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC

Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures

Model Theft Model Theft nlp

PDF

defense arXiv Feb 3, 2026 · 9w ago

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang, Tianyi Zhang, Tianqu Zhuang et al. · Tsinghua University · Harbin Institute of Technology

Defends proprietary LLMs from distillation-based theft by minimizing conditional mutual information in model logit outputs

Model Theft Model Theft nlp

PDF

defense arXiv Feb 3, 2026 · 9w ago

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani et al. · Carnegie Mellon University · University of Maryland

Fingerprints LLM outputs to detect unauthorized distillation using gradient-aligned token perturbations that transfer to student models

Model Theft Model Theft nlp

PDF

defense arXiv Jan 30, 2026 · 9w ago

FNF: Functional Network Fingerprint for Large Language Models

Yiheng Liu, Junhao Ning, Sichen Xia et al. · Northwestern Polytechnical University · Shaanxi Normal University

Training-free LLM fingerprinting via functional network activation patterns detects unauthorized model derivatives across architectures and scales

Model Theft Model Theft nlp

PDF Code

defense arXiv Jan 19, 2026 · 11w ago

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

Zhenhua Xu, Xiaoning Tian, Wenjun Zeng et al. · Zhejiang University · GenTel.io +4 more

Defends LLM IP by embedding kinship-narrative knowledge into model weights for stealthy, robust ownership verification

Model Theft Model Theft nlp

PDF Code

defense arXiv Jan 13, 2026 · 12w ago

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection

Zhenhua Xu, Yiran Zhao, Mengting Zhong et al. · Zhejiang University · Binjiang Institute of Zhejiang University +3 more

Hierarchical backdoor fingerprinting embeds nested stylistic and semantic triggers in LLMs to prove ownership against black-box theft

Model Theft Model Theft nlp

3 citations PDF Code

attack arXiv Jan 7, 2026 · Jan 2026

Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models

Hang Fu, Wanli Peng, Yinghan Zhou et al. · China Agricultural University

Attacks backdoor-based LLM ownership fingerprinting in ensemble settings using token filtering and perplexity-based sentence verification

Model Theft Model Theft nlp

PDF

attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp

PDF

defense arXiv Dec 19, 2025 · Dec 2025

Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

Muhammad Haris Khan · University of Copenhagen

Defends fine-tuned LLMs against unauthorized use via secret-key-conditioned orthonormal hidden-state scrambling in LoRA adapters

Model Theft Model Theft nlp

PDF

defense arXiv Dec 18, 2025 · Dec 2025

From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection

Hao Li, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences

Semantic-aware watermarking for EaaS embeddings using LSH to detect model imitation attacks while preserving embedding utility

Model Theft Model Theft nlp

PDF

attack arXiv Dec 10, 2025 · Dec 2025

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Sohely Jahan, Ruimin Sun

Black-box distillation clones a medical LLM for $12, collapsing safety alignment and achieving 86% adversarial jailbreak success

Model Theft Model Theft Prompt Injection nlp

PDF

defense arXiv Dec 3, 2025 · Dec 2025

SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting

Hanxiu Zhang, Yue Zheng · The Chinese University of Hong Kong

Fingerprints LLM weights via singular value decomposition to detect stolen models, resisting false claims and weight manipulation attacks

Model Theft Model Theft nlp

1 citations PDF Code

Loading more papers…

Latest papers

How Vulnerable Are Edge LLMs?

Functional Subspace Watermarking for Large Language Models

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

How to Steal Reasoning Without Reasoning Traces

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Vulnerabilities in Partial TEE-Shielded LLM Inference with Precomputed Noise

A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors

On Protecting Agentic Systems' Intellectual Property via Watermarking

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Antidistillation Fingerprinting

FNF: Functional Network Fingerprint for Large Language Models

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection

Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models

Aggressive Compression Enables LLM Weight Theft

Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue