benchmark 2025

Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks

Taniya Gidatkar , Oluwaseun Ajao , Matthew Shardlow

0 citations

α

Published on arXiv

2509.09706

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

TextFooler reduces BERT-Base accuracy from 48% to 3% (93.75% attack success), while RoBERTa-Base and Flan-T5 maintain full accuracy (0% attack success rate).


This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies.


Key Contributions

  • Empirical comparison of adversarial robustness across BERT-Base, RoBERTa-Base, and Flan-T5 under TextFooler and BERTAttack
  • Identifies stark vulnerability disparity: BERT-Base suffers 93.75% attack success rate while RoBERTa-Base and Flan-T5 achieve 0% under the same attacks
  • Practical recommendations for balancing computational cost with defensive robustness in transformer LLMs

🛡️ Threat Analysis

Input Manipulation Attack

TextFooler and BERTAttack are adversarial example attacks that craft word-substituted inputs to cause misclassification at inference time — the core ML01 threat of evasion attacks. The attack goal is accuracy degradation (BERT-Base drops from 48% to 3%), not jailbreaking or prompt injection, making this an adversarial evasion study on NLP classifiers.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
black_boxinference_timeuntargeteddigital
Applications
text classification