defense 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong 1,2, Dongcheng Zhao 3,4,5,1,2, Guobin Shen 1,2, Xiang He 1,2, Dachuan Lin 5, Feifei Zhao 3,4,5,1,2, Yi Zeng 3,4,5,1,2

1 citations · 39 references · arXiv

α

Published on arXiv

2509.22732

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BIID significantly reduces Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all seven baseline defense methods while maintaining model utility.

BIID (Bidirectional Intention Inference Defense)

Novel technique introduced


The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding "jailbreak" attacks that exploit adversarial prompts to bypass safety alignment mechanisms. Existing defense research primarily focuses on single-turn attacks, whereas multi-turn jailbreak attacks progressively break through safeguards through by concealing malicious intent and tactical manipulation, ultimately rendering conventional single-turn defenses ineffective. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method integrates forward request-based intention inference with backward response-based intention retrospection, establishing a bidirectional synergy mechanism to detect risks concealed within seemingly benign inputs, thereby constructing a more robust guardrails that effectively prevents harmful content generation. The proposed method undergoes systematic evaluation compared with a no-defense baseline and seven representative defense methods across three LLMs and two safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all existing baseline methods while effectively maintaining practical utility. Notably, comparative experiments across three multi-turn safety datasets further validate the proposed model's significant advantages over other defense approaches.


Key Contributions

  • Proposes BIID, integrating forward request-based intention inference with backward response-based intention retrospection to detect malicious intent hidden across multi-turn conversations
  • Demonstrates that existing single-turn defenses degrade significantly under multi-turn jailbreak attacks, establishing a systematic benchmark of this gap
  • Outperforms seven representative defense baselines across three LLMs, ten attack methods, and five safety datasets on both single-turn and multi-turn attack scenarios

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
AdvBenchHarmBench
Applications
conversational aillm safety guardrails