defense 2025

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

Huizhen Shu , Xuying li , Zhuo Li

hydrox.ai

0 citations · 27 references · arXiv

Published on arXiv

2509.19839

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LatentGuard significantly improves safety controllability and response interpretability on Qwen3-8B without compromising utility, with consistent generalization to Mistral-7B confirming cross-architecture effectiveness.

LatentGuard

Novel technique introduced

Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.

Key Contributions

Three-stage framework combining rationalized SFT (reasoning-enhanced refusal fine-tuning) with a supervised VAE trained on intermediate MLP activations under multi-label adversarial annotations
Disentangled latent space that separates interpretable safety dimensions (attack type, attack method, benign indicator) from contextual features, enabling fine-grained controllable refusal
Targeted latent manipulation (Benign-On/Attack-Off and Benign-Off/Attack-On modes) for selective refusal validated on Qwen3-8B and cross-architecture generalization on Mistral-7B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timeblack_box

Datasets

Qwen3-8B evaluation suiteMistral-7B cross-architecture validation

Applications

llm safety alignmentjailbreak defenseadversarial prompt refusal

Read PDF arXiv DOI

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Fail-Closed Alignment for Large Language Models

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment