attack 2025

Text Adversarial Attacks with Dynamic Outputs

Wenqiang Wang , Siyuan Liang , Xiao Yan , Xiaochun Cao

0 citations · 67 references · arXiv

α

Published on arXiv

2509.22393

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

TDOA achieves 50.81% ASR on ChatGPT-4.1 with a single query in dynamic-output settings, and 82.68% ASR in conventional static-output scenarios, outperforming prior transfer-based text attack methods.

TDOA (Textual Dynamic Outputs Attack)

Novel technique introduced


Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.


Key Contributions

  • Introduces the Dynamic Outputs (DO) scenario for adversarial attacks, covering both multi-label classifiers (variable label count) and LLMs (out-of-space label generation)
  • Proposes TDOA, which uses clustering-based surrogate model training to convert DO into a static single-label task, enabling standard transfer-based attack methods to apply
  • Introduces a farthest-label targeted attack strategy that selects adversarial perturbations maximally deviating from coarse-grained predicted labels, achieving up to 50.81% ASR on ChatGPT-4.1 with a single query

🛡️ Threat Analysis

Input Manipulation Attack

TDOA crafts adversarial text inputs (word-level perturbations) to cause misclassification at inference time across text classifiers and LLM-based classifiers — extending the classical adversarial example paradigm to dynamic-output scenarios. The attack uses surrogate model training, importance scoring, and targeted perturbation strategies to maximally disrupt model predictions, which is the core ML01 threat model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
four unspecified NLP classification datasets (paper body truncated)
Applications
text classificationmulti-label classificationmachine translationllm-based classification