Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks
Yang Wang 1,2, Chenghua Lin 1,3
Published on arXiv
2501.02654
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
TTSO++ demonstrates strong robustness against TextFooler and TextBugger attacks while being task-agnostic and applicable across diverse NLP benchmarks beyond text classification.
TTSO++
Novel technique introduced
Recent advancements in natural language processing have highlighted the vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.
Key Contributions
- Expands the textual adversarial defence benchmark beyond text classification to include similarity/paraphrase identification, natural language inference, and commonsense reasoning tasks with more datasets, models, and recent defences
- Proposes TTSO++, a variant of training-time temperature scaling that incorporates a dynamic entropy term for confidence adjustment, improving robustness against TextFooler and TextBugger
- Identifies key gaps and future directions in synonyms-agnostic, structure-free adversarial defence for NLP
🛡️ Threat Analysis
Evaluates defences against word-substitution adversarial attacks (TextFooler, TextBugger) that cause misclassification at inference time on NLP classifiers — classic input manipulation/evasion attacks on transformer models.