benchmark 2025

Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks

Yang Wang 1,2, Chenghua Lin 1,3

4 citations · 79 references · COLING

α

Published on arXiv

2501.02654

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

TTSO++ demonstrates strong robustness against TextFooler and TextBugger attacks while being task-agnostic and applicable across diverse NLP benchmarks beyond text classification.

TTSO++

Novel technique introduced


Recent advancements in natural language processing have highlighted the vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.


Key Contributions

  • Expands the textual adversarial defence benchmark beyond text classification to include similarity/paraphrase identification, natural language inference, and commonsense reasoning tasks with more datasets, models, and recent defences
  • Proposes TTSO++, a variant of training-time temperature scaling that incorporates a dynamic entropy term for confidence adjustment, improving robustness against TextFooler and TextBugger
  • Identifies key gaps and future directions in synonyms-agnostic, structure-free adversarial defence for NLP

🛡️ Threat Analysis

Input Manipulation Attack

Evaluates defences against word-substitution adversarial attacks (TextFooler, TextBugger) that cause misclassification at inference time on NLP classifiers — classic input manipulation/evasion attacks on transformer models.


Details

Domains
nlp
Model Types
transformer
Threat Tags
inference_timeblack_box
Datasets
SST-2MNLIQQPAdvGLUE
Applications
text classificationnatural language inferenceparaphrase identificationcommonsense reasoning