defense 2025

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Muhammad Zeeshan Karamat , Sadman Saif , Christiana Chamon Garcia

0 citations · 16 references · arXiv

α

Published on arXiv

2512.22174

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BitFlipScope localizes bit-flip faults at block, layer, weight, and bit granularity in LLMs under both reference-available and reference-free settings, enabling performance recovery without fine-tuning

BitFlipScope

Novel technique introduced


Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.


Key Contributions

  • Dual-setting fault localization framework that works both with (differential hidden-state analysis) and without (residual-path perturbation + loss-sensitivity profiling) a clean reference model
  • Self-referential fault diagnosis method that infers corrupted regions directly from a single corrupted model without requiring a baseline
  • Lightweight, fine-tuning-free performance recovery via targeted parameter restoration (differential) and scaling-based attenuation (self-referential)

🛡️ Threat Analysis

Model Poisoning

Defends against deliberate bit-flip attacks (gradient-based weight targeting and Rowhammer hardware exploitation) that corrupt LLM model parameters post-deployment — the closest OWASP category for direct model weight corruption attacks, even though traditional ML10 focuses on trigger-based backdoors; the paper addresses the broader class of intentional model parameter manipulation attacks.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxphysicalinference_timetargeted
Datasets
GPT-2
Applications
large language modelssafety-critical nlp deploymenthardware-adversarial environments