Guoli Wang

Papers in Database (1)

defense arXiv Mar 8, 2026 · 29d ago

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Guoli Wang, Haonan Shi, Tu Ouyang et al. · Case Western Reserve University

Preserves LLM safety alignment during fine-tuning by regularizing confidence on a small subset of safety-critical tokens only

Transfer Learning Attack Prompt Injection nlp
PDF