LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation
Huizhen Shu , Xuying li , Zhuo Li
Published on arXiv
2509.19839
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
LatentGuard significantly improves safety controllability and response interpretability on Qwen3-8B without compromising utility, with consistent generalization to Mistral-7B confirming cross-architecture effectiveness.
LatentGuard
Novel technique introduced
Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.
Key Contributions
- Three-stage framework combining rationalized SFT (reasoning-enhanced refusal fine-tuning) with a supervised VAE trained on intermediate MLP activations under multi-label adversarial annotations
- Disentangled latent space that separates interpretable safety dimensions (attack type, attack method, benign indicator) from contextual features, enabling fine-grained controllable refusal
- Targeted latent manipulation (Benign-On/Attack-Off and Benign-Off/Attack-On modes) for selective refusal validated on Qwen3-8B and cross-architecture generalization on Mistral-7B