defense 2025

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Muhammad Zeeshan Karamat , Sadman Saif , Christiana Chamon Garcia

Virginia Tech

0 citations · 16 references · arXiv

Published on arXiv

2512.22174

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BitFlipScope localizes bit-flip faults at block, layer, weight, and bit granularity in LLMs under both reference-available and reference-free settings, enabling performance recovery without fine-tuning

BitFlipScope

Novel technique introduced

Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.

Key Contributions

Dual-setting fault localization framework that works both with (differential hidden-state analysis) and without (residual-path perturbation + loss-sensitivity profiling) a clean reference model
Self-referential fault diagnosis method that infers corrupted regions directly from a single corrupted model without requiring a baseline
Lightweight, fine-tuning-free performance recovery via targeted parameter restoration (differential) and scaling-based attenuation (self-referential)

🛡️ Threat Analysis

Model Poisoning

Defends against deliberate bit-flip attacks (gradient-based weight targeting and Rowhammer hardware exploitation) that corrupt LLM model parameters post-deployment — the closest OWASP category for direct model weight corruption attacks, even though traditional ML10 focuses on trigger-based backdoors; the paper addresses the broader class of intentional model parameter manipulation attacks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxphysicalinference_timetargeted

Datasets

GPT-2

Applications

large language modelssafety-critical nlp deploymenthardware-adversarial environments

Read PDF arXiv DOI

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

Localizing Malicious Outputs from CodeLLM

TFL: Targeted Bit-Flip Attack on Large Language Model

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

Inverting Trojans in LLMs

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs