Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks

Recent advancements in natural language processing have highlighted the vulnerability of deep learning models to adversarial attacks. While various defence mechanisms have been proposed, there is a lack of comprehensive benchmarks that evaluate these defences across diverse datasets, models, and tasks. In this work, we address this gap by presenting an extensive benchmark for textual adversarial defence that significantly expands upon previous work. Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art defence mechanisms, and extends the assessment to include critical tasks such as single-sentence classification, similarity and paraphrase identification, natural language inference, and commonsense reasoning. This work not only serves as a valuable resource for researchers and practitioners in the field of adversarial robustness but also identifies key areas for future research in textual adversarial defence. By establishing a new standard for benchmarking in this domain, we aim to accelerate progress towards more robust and reliable natural language processing systems.

Key Contributions

Expands the textual adversarial defence benchmark beyond text classification to include similarity/paraphrase identification, natural language inference, and commonsense reasoning tasks with more datasets, models, and recent defences
Proposes TTSO++, a variant of training-time temperature scaling that incorporates a dynamic entropy term for confidence adjustment, improving robustness against TextFooler and TextBugger
Identifies key gaps and future directions in synonyms-agnostic, structure-free adversarial defence for NLP

🛡️ Threat Analysis

Input Manipulation Attack

Evaluates defences against word-substitution adversarial attacks (TextFooler, TextBugger) that cause misclassification at inference time on NLP classifiers — classic input manipulation/evasion attacks on transformer models.