defense 2026

NeST: Neuron Selective Tuning for LLM Safety

Sasha Behrouzi , Lichao Wu , Mohamadreza Rostami , Ahmad-Reza Sadeghi

Technical University of Darmstadt

0 citations · 68 references · arXiv (Cornell University)

Published on arXiv

2602.16835

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

NeST reduces average attack success rate from 44.5% to 4.36% across 10 LLMs using only 0.44M trainable parameters — 17,310x fewer than full fine-tuning — while surpassing all baselines in safety performance.

NeST (Neuron Selective Tuning)

Novel technique introduced

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.

Key Contributions

NeST framework that selectively adapts a small subset of safety-relevant neurons (~0.44M parameters) while freezing the rest of the model, achieving targeted safety alignment without broad weight modification.
Neuron clustering mechanism that groups functionally coherent safety neurons and enforces shared updates within each cluster for stable, structured safety adaptation.
Benchmark across 10 open-weight LLMs showing 90.2% reduction in unsafe generations (ASR from 44.5% to 4.36%), outperforming full fine-tuning, LoRA, and circuit breakers while using 17,310x fewer updated parameters.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

AdvBench

Applications

llm safety alignmentopen-weight language modelschatbot safety

Read PDF arXiv DOI

NeST: Neuron Selective Tuning for LLM Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

Securing AI Agents Against Prompt Injection Attacks

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks

Defend LLMs Through Self-Consciousness