attack 2025

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

Jingkai Guo , Chaitali Chakrabarti , Deliang Fan

4 citations · 2 influential · 22 references · arXiv

α

Published on arXiv

2509.21843

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

A single bit flip out of billions of parameters degrades Qwen, LLaMA, and Gemma LLMs to below random-guess accuracy on MMLU and SST-2 in both BF16 and INT8 formats.

SBFA (Sneaky Bit-Flip Attack)

Novel technique introduced


Model integrity of Large language models (LLMs) has become a pressing security concern with their massive online deployment. Prior Bit-Flip Attacks (BFAs) -- a class of popular AI weight memory fault-injection techniques -- can severely compromise Deep Neural Networks (DNNs): as few as tens of bit flips can degrade accuracy toward random guessing. Recent studies extend BFAs to LLMs and reveal that, despite the intuition of better robustness from modularity and redundancy, only a handful of adversarial bit flips can also cause LLMs' catastrophic accuracy degradation. However, existing BFA methods typically focus on either integer or floating-point models separately, limiting attack flexibility. Moreover, in floating-point models, random bit flips often cause perturbed parameters to extreme values (e.g., flipping in exponent bit), making it not stealthy and leading to numerical runtime error (e.g., invalid tensor values (NaN/Inf)). In this work, for the first time, we propose SBFA (Sneaky Bit-Flip Attack), which collapses LLM performance with only one single bit flip while keeping perturbed values within benign layer-wise weight distribution. It is achieved through iterative searching and ranking through our defined parameter sensitivity metric, ImpactScore, which combines gradient sensitivity and perturbation range constrained by the benign layer-wise weight distribution. A novel lightweight SKIP searching algorithm is also proposed to greatly reduce searching complexity, which leads to successful SBFA searching taking only tens of minutes for SOTA LLMs. Across Qwen, LLaMA, and Gemma models, with only one single bit flip, SBFA successfully degrades accuracy to below random levels on MMLU and SST-2 in both BF16 and INT8 data formats. Remarkably, flipping a single bit out of billions of parameters reveals a severe security concern of SOTA LLM models.


Key Contributions

  • ImpactScore: a parameter sensitivity metric combining gradient sensitivity with perturbation range constrained by benign layer-wise weight distribution to identify the single most critical bit
  • SKIP (Selective sKipping for Impact Prioritization) search algorithm that reduces critical-bit search complexity to tens of minutes on SOTA LLMs
  • First single-bit attack effective across both BF16 and INT8 quantization formats while remaining stealthy (no NaN/Inf values)

🛡️ Threat Analysis

Model Poisoning

SBFA directly corrupts model weights via hardware fault injection (bit flips) to cause catastrophic model degradation. While it does not insert a classic trigger-based backdoor, the attack poisons model parameters post-training to induce targeted malicious behavior (total performance collapse), which falls squarely under model poisoning through direct weight manipulation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtargetedphysical
Datasets
MMLUSST-2
Applications
large language model inferencetext classificationnatural language understanding