attack 2026

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Yiyang Lu 1,2, Jinwen He 1,2, Yue Zhao 1,2, Kai Chen 1,2, Ruigang Liang 1,2

0 citations · 21 references · arXiv

α

Published on arXiv

2601.14340

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

TST achieves 99.52% average attack success rate across four LLM families with only 1,800 poisoned dialogues and remains effective at 98.04% ASR against five representative defenses

Turn-based Structural Trigger (TST)

Novel technique introduced


Large Language Models (LLMs) are widely integrated into interactive systems such as dialogue agents and task-oriented assistants. This growing ecosystem also raises supply-chain risks, where adversaries can distribute poisoned models that degrade downstream reliability and user trust. Existing backdoor attacks and defenses are largely prompt-centric, focusing on user-visible triggers while overlooking structural signals in multi-turn conversations. We propose Turn-based Structural Trigger (TST), a backdoor attack that activates from dialogue structure, using the turn index as the trigger and remaining independent of user inputs. Across four widely used open-source LLM models, TST achieves an average attack success rate (ASR) of 99.52% with minimal utility degradation, and remains effective under five representative defenses with an average ASR of 98.04%. The attack also generalizes well across instruction datasets, maintaining an average ASR of 99.19%. Our results suggest that dialogue structure constitutes an important and under-studied attack surface for multi-turn LLM systems, motivating structure-aware auditing and mitigation in practice.


Key Contributions

  • Identifies dialogue-structure signals (turn position, role tags, formatting) as a new and unexplored backdoor trigger channel in multi-turn LLM systems
  • Proposes Turn-based Structural Trigger (TST), a prompt-free backdoor that activates deterministically at a specified conversation turn without requiring any user-visible trigger
  • Demonstrates TST achieves 99.52% ASR across four LLM families with ~96.47% utility retention and 98.04% ASR under five representative defenses, exposing the inadequacy of prompt-centric defenses

🛡️ Threat Analysis

Model Poisoning

TST is a backdoor/trojan attack that embeds hidden targeted malicious behavior (advertisements, harmful content) in LLMs, activated only when conversation reaches a specific turn index while behaving normally otherwise — a classic backdoor with a novel structural trigger.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeteddigital
Datasets
UltraChatChatAlpaca-20K
Applications
dialogue agentstask-oriented assistantsmulti-turn llm systems