From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

Key Contributions

Identifies 'Semantic Representation Decay' as the mechanistic root cause of shallow safety alignment failure under adversarial prefix attacks
Proposes a causal intent probe (Stage 1) using hard-negative augmentation to disentangle invariant malicious intent from stylistic compliance signals
Introduces a cumulative causal penalty within GRPO (Stage 2) via 'fork-in-the-road' training to enforce robust late-stage refusals even after compliant prefixes

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_timeblack_box

Applications

llm safety alignmentchatbot safetyjailbreak defense

2025 0 cit.

100%