NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
Birong Pan , Mayi Xu , Qiankun Pi , Jianhao Chen , Yuanyuan Zhu , Ming Zhong , Tieyun Qian
Published on arXiv
2508.09473
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
NeuronTune achieves superior robustness against jailbreaking attacks while maintaining general task utility, significantly outperforming layer-wise safety alignment baselines across multiple LLMs and benchmarks.
NeuronTune
Novel technique introduced
Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance--the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.
Key Contributions
- Attack-aware attribution method that identifies safety-critical and utility-preserving neurons across all layers of an LLM by analyzing responses under adversarial attack conditions
- Meta-learning (MAML)-driven adaptive modulation of sparse safety/utility neurons, replacing coarse layer-wise interventions with fine-grained neuron-level control
- Tunable neuron-count threshold mechanism enabling flexible trade-off between security-critical and utility-priority deployment scenarios