SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

Key Contributions

Theoretical framework proving output-only defenses cannot detect spectral collapse attacks (Evasion Existence Theorem)
HiSPA attack that induces spectral collapse via gradient-based manipulation of SSM step size, degrading accuracy by 42-53 percentage points
SpectralGuard real-time defense monitoring spectral radius across layers with F1=0.961 non-adaptive, F1=0.842 adaptive, <15ms latency

🛡️ Threat Analysis

Input Manipulation Attack

HiSPA is a gradient-based adversarial attack that manipulates input-dependent parameters (Δ_t) to induce spectral collapse, causing misclassification and reasoning failures at inference time. The attack optimizes adversarial inputs to drive the spectral radius toward zero, which is a clear inference-time input manipulation attack.

Details

Domains

nlp

Model Types

rnntransformer

Threat Tags

white_boxinference_timedigital

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking

Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models

Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution

Hacking Neural Evaluation Metrics with Single Hub Text

RedHerring Attack: Testing the Reliability of Attack Detection

Semantics-Preserving Evasion of LLM Vulnerability Detectors