Latest papers

1 papers
defense arXiv Feb 19, 2026 · 6w ago

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder · Indian Institute of Technology Jodhpur

Defends LLMs from harmful fine-tuning attacks via adaptive KL regularization guided by a safety critic or activation-based risk predictor

Transfer Learning Attack Prompt Injection nlp
PDF Code