attack 2026

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu , Jiaqi Li , Xiaotong Zhang , Hong Yu , Han Liu

0 citations

α

Published on arXiv

2603.03081

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TAO-Attack achieves up to 100% attack success rate on certain LLMs, consistently outperforming GCG, MAC, and I-GCG by suppressing refusals and eliminating pseudo-harmful outputs via a two-stage loss

TAO-Attack (DPTO)

Novel technique introduced


Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.


Key Contributions

  • Two-stage loss function: stage one suppresses refusals by forcing harmful prefix continuation; stage two penalizes pseudo-harmful outputs (safety disclaimers) to enforce genuinely harmful completions
  • Direction-Priority Token Optimization (DPTO) strategy that filters candidates by gradient direction alignment before ranking by update magnitude, reducing unstable token updates
  • Empirical outperformance of GCG, MAC, and I-GCG on multiple open- and closed-source LLMs, including 100% attack success rate on certain models

🛡️ Threat Analysis

Input Manipulation Attack

TAO-Attack uses gradient-based token-level optimization (GCG-style) to craft adversarial suffixes that bypass LLM safety alignment — this is adversarial suffix optimization via gradient signal, squarely ML01.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety alignment bypassjailbreak attacksred-teaming llms