PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.

Key Contributions

Novel 'inside-out' attack strategy that identifies Pivot Sets (critical multi-token anchors) and perturbs them to efficiently cross the decision boundary, avoiding the query-expensive 'outside-in' refinement of prior work
Formulation of Pivot Set identification as a Multi-Armed Bandit (KL-LUCB) problem, capturing inter-word dependencies rather than scoring tokens independently
Demonstrated effectiveness against both traditional NLP models and fine-tuned/zero-shot LLMs, outperforming SOTA baselines in attack success rate and query efficiency

🛡️ Threat Analysis

Input Manipulation Attack

Proposes a novel adversarial example attack on NLP text classifiers in the hard-label black-box setting — the attacker crafts word substitutions to cause misclassification at inference time, directly targeting model input integrity across both traditional NLP models and LLMs used as classifiers.

Details

Domains

nlp

Model Types

transformerllm

Threat Tags

black_boxinference_timetargeteddigital

Applications

2025 1 cit.

Input Manipulation Attack

85%