attack 2025

How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection

Zafaryab Haider ¹, Md Hafizur Rahman ¹, Shane Moeykens ¹, Vijay Devabhaktuni ², Prabuddha Chakraborty ¹

¹ University of Maine

² Illinois State University

0 citations · 49 references · arXiv

Published on arXiv

2512.14715

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

A single flipped bit in VLM weights can differentiably and predictably steer the high-level semantic content of generated captions without degrading grammatical fluency, exposing a novel attack surface in generative LLMs.

BLADE (Bit-level Fault Analysis via Differentiable Estimation)

Novel technique introduced

Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model's own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model's semantic output.

Key Contributions

First investigation of how hardware bit-flip fault injection to LLM/VLM weights can semantically steer generated captions while preserving grammatical fluency
BLADE framework using gradient-based sensitivity estimation to identify which specific weight bits, when flipped, most strongly shift semantic meaning
Demonstrates that semantic drift from bit-level perturbations is non-random and differentiably estimable via the model's own gradients

🛡️ Threat Analysis

Model Poisoning

BLADE directly manipulates deployed model weights at the bit level to produce targeted semantic changes in outputs — a form of model weight poisoning/tampering. Unlike classic backdoors with triggers, the alteration is persistent, but the core threat is intentional, targeted modification of model parameters to steer model behavior, which falls squarely under model poisoning.

Details

Domains

visionnlpmultimodalgenerative

Model Types

vlmllmmultimodal

Threat Tags

white_boxinference_timetargeteddigital

Applications

image captioninggenerative vision-language models

Read PDF arXiv DOI

How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Concept-Guided Backdoor Attack on Vision Language Models

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents