attack 2025

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

Zixuan Qin ¹, Qingchen Yu ^2,3, Kunlin Lyu ¹, Zhaoxin Fan ^2,3, Yifan Sun ¹

¹ Renmin University of China

² Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing

³ Beihang University

2 citations · 70 references · arXiv

Published on arXiv

2510.10238

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Disrupting as few as 3 neurons in a 72B-parameter LLM (over 1.1 billion neurons) causes complete model collapse with perplexity increasing by up to 20 orders of magnitude, revealing extreme fragility concentrated in MLP down_proj components of outer layers.

PCICN (Perturbation-based Causal Identification of Critical Neurons)

Novel technique introduced

Large Language Models (LLMs) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that LLMs share significant similarities with the human brain. Recent neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do LLMs also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs. Our findings reveal three key insights: (1) LLMs contain ultra-sparse critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude; (2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down\_proj components; (3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted. Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for LLM robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications. Our code is available at https://github.com/qqqqqqqzx/The-Achilles-Heel-of-LLMs.

Key Contributions

Proposes PCICN (Perturbation-based Causal Identification of Critical Neurons), a two-stage method using noise injection and greedy masking to locate ultra-sparse critical neuron sets in LLMs
Demonstrates that corrupting as few as 3 neurons in a 72B-parameter LLM causes complete model collapse with perplexity increases of up to 20 orders of magnitude
Characterizes critical neuron properties: concentrated in outer MLP down_proj layers, non-uniformly distributed, and exhibiting sharp phase-transition degradation rather than gradual decline

🛡️ Threat Analysis

Model Poisoning

The paper demonstrates targeted weight manipulation of identified critical neurons that causes catastrophic, irreversible performance degradation in LLMs. While not a classic backdoor/trigger attack, the PCICN method functions as an attack pipeline for surgical model sabotage — an adversary with white-box weight access can use it to identify and corrupt the minimal set of weights that destroys full model functionality. This is the closest OWASP category for targeted model weight corruption attacks, though the fit is imperfect since ML10 canonically covers backdoor trigger insertion rather than capability destruction.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtargeted

Datasets

WikiText-103C4MMLU-ProIFEvalGPQA-DiamondHumanEvalMATHMGSMSimpleQA

Applications

large language modelsllm deployment security

Read PDF arXiv DOI Code

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SASER: Stego attacks on open-source LLMs

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

TFL: Targeted Bit-Flip Attack on Large Language Model

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

ShadowLogic: Backdoors in Any Whitebox LLM

Adversarial Contrastive Learning for LLM Quantization Attacks