PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words
Yuzhi Liang , Shiliang Xiao , Jingsong Wei , Qiliang Lin , Xia Li
Published on arXiv
2603.10842
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
PivotAttack consistently outperforms state-of-the-art hard-label black-box text attacks in both Attack Success Rate and query efficiency across traditional models and LLMs, with strong results even against robust fine-tuned LLMs.
PivotAttack
Novel technique introduced
Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.
Key Contributions
- Novel 'inside-out' attack strategy that identifies Pivot Sets (critical multi-token anchors) and perturbs them to efficiently cross the decision boundary, avoiding the query-expensive 'outside-in' refinement of prior work
- Formulation of Pivot Set identification as a Multi-Armed Bandit (KL-LUCB) problem, capturing inter-word dependencies rather than scoring tokens independently
- Demonstrated effectiveness against both traditional NLP models and fine-tuned/zero-shot LLMs, outperforming SOTA baselines in attack success rate and query efficiency
🛡️ Threat Analysis
Proposes a novel adversarial example attack on NLP text classifiers in the hard-label black-box setting — the attacker crafts word substitutions to cause misclassification at inference time, directly targeting model input integrity across both traditional NLP models and LLMs used as classifiers.