defense 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong ^1,2, Dongcheng Zhao ^3,4,5,1,2, Guobin Shen ^1,2, Xiang He ^1,2, Dachuan Lin ⁵, Feifei Zhao ^3,4,5,1,2, Yi Zeng ^3,4,5,1,2

¹ University of Chinese Academy of Sciences

² Long-term AI

³ Beijing Key Laboratory of Safe AI and Superalignment

⁴ Beijing Institute of AI Safety and Governance

⁵ Chinese Academy of Sciences

1 citations · 39 references · arXiv

Published on arXiv

2509.22732

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BIID significantly reduces Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all seven baseline defense methods while maintaining model utility.

BIID (Bidirectional Intention Inference Defense)

Novel technique introduced

The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding "jailbreak" attacks that exploit adversarial prompts to bypass safety alignment mechanisms. Existing defense research primarily focuses on single-turn attacks, whereas multi-turn jailbreak attacks progressively break through safeguards through by concealing malicious intent and tactical manipulation, ultimately rendering conventional single-turn defenses ineffective. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method integrates forward request-based intention inference with backward response-based intention retrospection, establishing a bidirectional synergy mechanism to detect risks concealed within seemingly benign inputs, thereby constructing a more robust guardrails that effectively prevents harmful content generation. The proposed method undergoes systematic evaluation compared with a no-defense baseline and seven representative defense methods across three LLMs and two safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all existing baseline methods while effectively maintaining practical utility. Notably, comparative experiments across three multi-turn safety datasets further validate the proposed model's significant advantages over other defense approaches.

Key Contributions

Proposes BIID, integrating forward request-based intention inference with backward response-based intention retrospection to detect malicious intent hidden across multi-turn conversations
Demonstrates that existing single-turn defenses degrade significantly under multi-turn jailbreak attacks, establishing a systematic benchmark of this gap
Outperforms seven representative defense baselines across three LLMs, ten attack methods, and five safety datasets on both single-turn and multi-turn attack scenarios

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

AdvBenchHarmBench

Applications

conversational aillm safety guardrails

Read PDF arXiv DOI

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

From static to adaptive: immune memory-based jailbreak detection for large language models

Securing AI Agents Against Prompt Injection Attacks

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

SecInfer: Preventing Prompt Injection via Inference-time Scaling

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems