defense 2026

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu 1,2, Yanshu Li 1,2, Wen Liu 1,2, Haitian Yang 1

0 citations

α

Published on arXiv

2603.12582

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Significantly outperforms existing detection baselines while requiring only two black-box queries and no adversarial training data

RTD-Guard

Novel technique introduced


Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.


Key Contributions

  • First to repurpose RTD pre-training task for adversarial detection without fine-tuning or adversarial training data
  • Training-free black-box detection requiring only two queries to victim model
  • Outperforms existing detection baselines across multiple metrics with minimal computational overhead

🛡️ Threat Analysis

Input Manipulation Attack

Defends against textual adversarial attacks (word-substitution perturbations) that cause misclassification at inference time in NLP models.


Details

Domains
nlp
Model Types
transformer
Threat Tags
black_boxinference_timedigital
Applications
text classificationnlp systems