Hidden State Poisoning Attacks against Mamba-based Language Models
Alexandre Le Mercier , Chris Develder , Thomas Demeester
Published on arXiv
2601.01972
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
A 52B hybrid Jamba SSM-Transformer model collapses under optimized HiSPA triggers on RoBench25 while pure Transformers remain unaffected, revealing a critical architectural vulnerability inherent to SSM hidden state mechanics
HiSPA (Hidden State Poisoning Attack)
Novel technique introduced
State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.
Key Contributions
- Introduces HiSPA (Hidden State Poisoning Attack), a novel attack using optimized short phrases to irreversibly overwrite Mamba SSM hidden states, inducing partial amnesia about prior context
- Proposes RoBench25, a benchmark for evaluating model information retrieval under HiSPA attacks, empirically confirming SSM architectural vulnerability versus Transformer resilience
- Demonstrates that HiSPA triggers amplify prompt injection susceptibility in 52B hybrid Jamba SSM-Transformer models and provides interpretability analysis of hidden layer patterns to guide mitigation
🛡️ Threat Analysis
HiSPA crafts optimized short trigger phrases that manipulate model hidden states at inference time — these are adversarial inputs (analogous to adversarial suffix optimization) that disrupt correct model behavior through strategic input construction.