Limits of Convergence-Rate Control for Open-Weight Safety

Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.

Key Contributions

SpecDef algorithm: spectral reparameterization of model weights that provably slows first- and second-order optimization during harmful fine-tuning without degrading original model performance
Theoretical framework linking convergence-rate control to the Hessian spectral structure, enabling tractable analysis of fine-tuning resistance for large foundation models
Fundamental impossibility result: the entire class of convergence-rate control defenses can be defeated by an informed adversary at only linear cost in model size, establishing a ceiling for this approach

🛡️ Threat Analysis

Transfer Learning Attack

The core threat is adversarial exploitation of fine-tuning/transfer learning to repurpose open-weight pre-trained models for harmful tasks (weapons, deepfakes); SpecDef specifically resists this by manipulating spectral structure to slow fine-tuning convergence, and the paper establishes fundamental limits of this class of fine-tuning-resistance defenses.

Details

Domains

visionnlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_time

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment