FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.

Key Contributions

FlipLLM: an architecture-agnostic RL framework combining Q-learning with sensitivity-guided layer pruning to efficiently discover minimal high-impact bit-flip sets in LLMs and VLMs
Achieves 2.5x speedup over SOTA BFA discovery methods, collapsing LLaMA 3.1 8B from 69.9% to ~0.2% and LLaVA VQA score from 78% to near 0% with as few as 5–7 bit flips
First scalable BFA methodology for multimodal foundation models, with analysis demonstrating ECC SECDED hardware protection fully mitigates the identified critical bit locations

🛡️ Threat Analysis

Model Poisoning

Bit-flip attacks physically corrupt model weight parameters in hardware memory, manipulating the model's internal state to cause catastrophic functional failure — this is a direct attack on model parameters analogous to model poisoning, though the vector is hardware-level rather than retraining. ML10 is the closest OWASP category for weight-corruption attacks that cause targeted model malfunction.

Details

Domains

nlpmultimodal

Model Types

llmvlmtransformer

Threat Tags

grey_boxinference_timephysicaltargeted

Datasets

MMLUMMLU-ProVQAv2TextVQA

Applications

2025 0 cit.

Model Poisoning

59%