defense 2026

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati 1,2, Xijie Zeng 1,2, Hong Huang 1, Sebastian Dionicio 1, Subhabrata Majumdar 3, Frank Rudzicz 1,2, Hassan Sajjad 1

0 citations · arXiv (Cornell University)

α

Published on arXiv

2602.18868

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

SpecDef provably slows harmful fine-tuning in non-adversarial settings, but adversarial attackers with model knowledge can restore fast convergence at linear cost in model size, showing convergence-rate control is insufficient as a security guarantee

SpecDef

Novel technique introduced


Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.


Key Contributions

  • SpecDef algorithm: spectral reparameterization of model weights that provably slows first- and second-order optimization during harmful fine-tuning without degrading original model performance
  • Theoretical framework linking convergence-rate control to the Hessian spectral structure, enabling tractable analysis of fine-tuning resistance for large foundation models
  • Fundamental impossibility result: the entire class of convergence-rate control defenses can be defeated by an informed adversary at only linear cost in model size, establishing a ceiling for this approach

🛡️ Threat Analysis

Transfer Learning Attack

The core threat is adversarial exploitation of fine-tuning/transfer learning to repurpose open-weight pre-trained models for harmful tasks (weapons, deepfakes); SpecDef specifically resists this by manipulating spectral structure to slow fine-tuning convergence, and the paper establishes fundamental limits of this class of fine-tuning-resistance defenses.


Details

Domains
visionnlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_time
Applications
open-weight foundation model safetyharmful fine-tuning resistancensfw content generation prevention