Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu 1, Xuying Li 1, Qirui Wang 1, Yuji Kosuga 1,2, Mengqiu Tian 3, Zhuo Li 1
Published on arXiv
2508.10404
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Adversarial texts generated by SFPF successfully bypass state-of-the-art defense mechanisms in LLMs by perturbing sparse autoencoder features in critical hidden layers.
SFPF (Sparse Feature Perturbation Framework)
Novel technique introduced
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
Key Contributions
- Proposes SFPF, which uses sparse autoencoders to identify high-activation features in LLM hidden layers from successfully attacked prompts via KMeans clustering
- Introduces a red-teaming strategy that perturbs SAE-derived feature representations to generate adversarial texts without requiring token-level gradient optimization
- Demonstrates empirically that SFPF-generated adversarial examples bypass state-of-the-art LLM defense mechanisms across multiple benchmarks
🛡️ Threat Analysis
SFPF operates at the hidden-layer representation level — using SAEs to identify high-activation features and perturbing them to generate adversarial inputs — this is representation-level adversarial input manipulation, going beyond surface-level text substitution.