defense 2026

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du ¹, Han Fang ², Haokai Ma ², Gang Yang ², Quanjun Yin ¹, Shouling Ji ³, Ee-chien Chang ²

¹ National University of Defense Technology

² National University of Singapore

³ Zhejiang University

0 citations · 57 references · arXiv (Cornell University)

Published on arXiv

2602.06630

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TrapSuffix reduces average attack success rate to below 0.01% and achieves 87.9% tracing success rate against suffix-based jailbreaks, using only 15.87 MB of additional memory versus ~10^4 MB for LLM-based detection defenses.

TrapSuffix

Novel technique introduced

Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.

Key Contributions

TrapSuffix: a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model, forcing attackers into a no-win dilemma (optimization traps or traceable fingerprints)
Reduces average attack success rate to below 0.01% across diverse suffix-based jailbreak settings with no inference-time overhead and only ~16 MB additional memory
Achieves 87.9% average tracing success rate, enabling attribution of adversarial suffixes that evade the trap

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial suffix optimization attacks (GCG, AutoDAN) — gradient-based discrete token-level perturbations that cause misclassification/unsafe outputs at inference time. This is the canonical ML01 adversarial example scenario applied to LLMs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Datasets

AdvBench

Applications

aligned llm safetyjailbreak defenseadversarial suffix tracing

Read PDF arXiv DOI

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Inverse Language Modeling towards Robust and Grounded LLMs

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Bypassing Prompt Injection Detectors through Evasive Injections

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent