RedHerring Attack: Testing the Reliability of Attack Detection
Published on arXiv
2509.20691
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
RedHerring drops adversarial attack detection accuracy by 20–71 percentage points across tested detectors while maintaining or improving the underlying classifier's accuracy, demonstrating a new second-order evasion threat against NLP defense pipelines.
RedHerring
Novel technique introduced
In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.
Key Contributions
- Novel 'RedHerring' threat model that targets adversarial text attack detectors by inducing false positives, reducing detector accuracy by 20–71 points while preserving classifier accuracy
- Empirical evaluation across 4 datasets, 3 detectors, and 4 classifiers demonstrating the attack's effectiveness in undermining human trust in detection models
- Simple confidence-check defense requiring no retraining that substantially recovers detection accuracy against the RedHerring attack
🛡️ Threat Analysis
RedHerring crafts modified text inputs at inference time to cause incorrect outputs from adversarial attack detection models — this is a second-order input manipulation attack that exploits the detection layer rather than the primary classifier. The attack causes misclassification (false positives) in the detector by manipulating the input text, which is the core of ML01.