How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection
Zafaryab Haider 1, Md Hafizur Rahman 1, Shane Moeykens 1, Vijay Devabhaktuni 2, Prabuddha Chakraborty 1
Published on arXiv
2512.14715
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
A single flipped bit in VLM weights can differentiably and predictably steer the high-level semantic content of generated captions without degrading grammatical fluency, exposing a novel attack surface in generative LLMs.
BLADE (Bit-level Fault Analysis via Differentiable Estimation)
Novel technique introduced
Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model's own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model's semantic output.
Key Contributions
- First investigation of how hardware bit-flip fault injection to LLM/VLM weights can semantically steer generated captions while preserving grammatical fluency
- BLADE framework using gradient-based sensitivity estimation to identify which specific weight bits, when flipped, most strongly shift semantic meaning
- Demonstrates that semantic drift from bit-level perturbations is non-random and differentiably estimable via the model's own gradients
🛡️ Threat Analysis
BLADE directly manipulates deployed model weights at the bit level to produce targeted semantic changes in outputs — a form of model weight poisoning/tampering. Unlike classic backdoors with triggers, the alteration is persistent, but the core threat is intentional, targeted modification of model parameters to steer model behavior, which falls squarely under model poisoning.