SpectralGuard: Detecting Memory Collapse Attacks in State Space Models
Published on arXiv
2603.12414
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
HiSPA attack collapses spectral radius from 0.98 to 0.32, degrading accuracy by 42-53 points; SpectralGuard defense achieves F1=0.961 (AUC=0.989) against non-adaptive attacks with sub-15ms per-token overhead
HiSPA (Hidden State Poisoning Attack) and SpectralGuard
Novel technique introduced
State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.
Key Contributions
- Theoretical framework proving output-only defenses cannot detect spectral collapse attacks (Evasion Existence Theorem)
- HiSPA attack that induces spectral collapse via gradient-based manipulation of SSM step size, degrading accuracy by 42-53 percentage points
- SpectralGuard real-time defense monitoring spectral radius across layers with F1=0.961 non-adaptive, F1=0.842 adaptive, <15ms latency
🛡️ Threat Analysis
HiSPA is a gradient-based adversarial attack that manipulates input-dependent parameters (Δ_t) to induce spectral collapse, causing misclassification and reasoning failures at inference time. The attack optimizes adversarial inputs to drive the spectral radius toward zero, which is a clear inference-time input manipulation attack.