defense 2026

Towards Building Non-Fine-Tunable Foundation Models

Ziyao Wang ¹, Nizhang Li ², Pingzhi Li ³, Guoheng Sun ¹, Tianlong Chen ³, Ang Li ¹

¹ University of Maryland, College Park

² Macau University of Science and Technology

³ The University of North Carolina at Chapel Hill

0 citations · 26 references · arXiv

Published on arXiv

2602.00446

Transfer Learning Attack

OWASP ML Top 10 — ML07

Model Theft

OWASP ML Top 10 — ML05

Key Finding

PMP preserves base LLM performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with resistance strength controlled by the mask sparsity ratio.

Private Mask Pre-Training (PMP)

Novel technique introduced

Open-sourcing foundation models (FMs) enables broad reuse but also exposes model trainers to economic and safety risks from unrestricted downstream fine-tuning. We address this problem by building non-fine-tunable foundation models: models that remain broadly usable in their released form while yielding limited adaptation gains under task-agnostic unauthorized fine-tuning. We propose Private Mask Pre-Training (PMP), a pre-training framework that concentrates representation learning into a sparse subnetwork identified early in training. The binary mask defining this subnetwork is kept private, and only the final dense weights are released. This forces unauthorized fine-tuning without access to the mask to update parameters misaligned with pretraining subspace, inducing an intrinsic mismatch between the fine-tuning objective and the pre-training geometry. We provide theoretical analysis showing that this mismatch destabilizes gradient-based adaptation and bounds fine-tuning gains. Empirical results on large language models demonstrating that PMP preserves base model performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with the strength of non-fine-tunability controlled by the mask ratio.

Key Contributions

Private Mask Pre-Training (PMP): concentrates representation learning into a sparse subnetwork whose binary mask is kept secret, releasing only the dense weights to frustrate unauthorized fine-tuning
Theoretical analysis showing that fine-tuning without the mask induces a geometric mismatch with the pre-training subspace that destabilizes gradient-based adaptation and bounds fine-tuning gains
Empirical demonstration on LLMs that PMP preserves base model capability while consistently degrading unauthorized fine-tuning across diverse downstream tasks, with non-fine-tunability strength tunable via mask ratio

🛡️ Threat Analysis

Model Theft

The paper's dual motivation includes protecting the economic IP value of open-source models from unauthorized adaptation — analogous to anti-distillation/anti-cloning defenses. PMP functions as a structural deterrent against value extraction through unauthorized downstream specialization, a form of model IP protection.

Transfer Learning Attack

PMP directly defends the transfer learning/fine-tuning process: it makes gradient-based adaptation ineffective for parties lacking the private subnetwork mask, by exploiting a geometric mismatch between fine-tuning updates and the pre-training subspace. The threat model is explicitly unauthorized fine-tuning (LoRA, full fine-tuning) of a released model.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_time

Applications

open-source foundation model protectionlanguage model pre-trainingunauthorized fine-tuning prevention

Read PDF arXiv DOI

Towards Building Non-Fine-Tunable Foundation Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement