RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

Key Contributions

Refusal-aware regularizer that steers activations away from refusal directions in embedding space during adversarial suffix optimization
Joint objective combining restricted-response encouragement, refusal-direction steering, and coherence regularization for fluent adversarial suffixes
Critic-guided decoding procedure that maps optimized continuous embeddings back to natural-sounding discrete tokens

🛡️ Threat Analysis

Input Manipulation Attack

RAID optimizes adversarial suffixes by relaxing discrete tokens into continuous embeddings and applying gradient-based optimization — this is gradient-based adversarial suffix optimization (same class as GCG), which qualifies as an input manipulation attack at inference time.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

AdvBench

Applications

2025 0 cit.

Input Manipulation Attack

93%

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Language Model Inversion through End-to-End Differentiation

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

H-Node Attack and Defense in Large Language Models

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?