From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
Shuyi Zhou 1,2, Zeen Song 1,3, Wenwen Qiang 3, Jiyan Sun 2, Yao Zhou 1,3, Yinlong Liu 2, Wei Ma 2
Published on arXiv
2603.02675
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
TSC-GRPO significantly outperforms baseline alignment methods in defending against adversarial prefix jailbreak attacks while preserving general model utility.
TSC-GRPO (Two-Stage Causal-GRPO)
Novel technique introduced
Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
Key Contributions
- Identifies 'Semantic Representation Decay' as the mechanistic root cause of shallow safety alignment failure under adversarial prefix attacks
- Proposes a causal intent probe (Stage 1) using hard-negative augmentation to disentangle invariant malicious intent from stylistic compliance signals
- Introduces a cumulative causal penalty within GRPO (Stage 2) via 'fork-in-the-road' training to enforce robust late-stage refusals even after compliant prefixes