Limits of Convergence-Rate Control for Open-Weight Safety
Domenic Rosati 1,2, Xijie Zeng 1,2, Hong Huang 1, Sebastian Dionicio 1, Subhabrata Majumdar 3, Frank Rudzicz 1,2, Hassan Sajjad 1
Published on arXiv
2602.18868
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
SpecDef provably slows harmful fine-tuning in non-adversarial settings, but adversarial attackers with model knowledge can restore fast convergence at linear cost in model size, showing convergence-rate control is insufficient as a security guarantee
SpecDef
Novel technique introduced
Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.
Key Contributions
- SpecDef algorithm: spectral reparameterization of model weights that provably slows first- and second-order optimization during harmful fine-tuning without degrading original model performance
- Theoretical framework linking convergence-rate control to the Hessian spectral structure, enabling tractable analysis of fine-tuning resistance for large foundation models
- Fundamental impossibility result: the entire class of convergence-rate control defenses can be defeated by an informed adversary at only linear cost in model size, establishing a ceiling for this approach
🛡️ Threat Analysis
The core threat is adversarial exploitation of fine-tuning/transfer learning to repurpose open-weight pre-trained models for harmful tasks (weapons, deepfakes); SpecDef specifically resists this by manipulating spectral structure to slow fine-tuning convergence, and the paper establishes fundamental limits of this class of fine-tuning-resistance defenses.