attack 2026

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson 1, Bo Fang 2, Sanghyun Hong 1

0 citations · 60 references · arXiv (Cornell University)

α

Published on arXiv

2602.17778

Model Poisoning

OWASP ML Top 10 — ML10

Model Denial of Service

OWASP LLM Top 10 — LLM04

Key Finding

Fine-tuning and parameter corruption attacks substantially increase multi-turn interaction counts across instruction-tuned LLMs while remaining task-compliant, with existing defenses offering only limited protection.

Turn Amplification

Novel technique introduced


Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.


Key Contributions

  • Identifies 'turn amplification' as a novel failure mode in conversational LLMs in which adversaries exploit clarification-seeking dynamics to scalably inflate multi-turn operational costs
  • Mechanistically identifies a query-independent, universal activation subspace associated with clarification-seeking responses that persists across prompts and tasks
  • Demonstrates two attack vectors — supply-chain attacks via fine-tuning and runtime attacks via low-level parameter corruption — that persistently induce turn amplification while maintaining apparent compliance

🛡️ Threat Analysis

Model Poisoning

The paper demonstrates fine-tuning (supply-chain) and runtime parameter corruption attacks that embed persistent clarification-seeking behavior directly in model weights, constituting model weight manipulation to introduce malicious behavior. Although the induced behavior is general rather than trigger-activated, the attack mechanism is direct weight/parameter manipulation to embed targeted malicious behavior rather than prompt-level exploitation.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxgrey_boxtraining_timeinference_timetargeted
Applications
conversational aiinstruction-tuned llmsmulti-turn dialogue systems