defense 2025

Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification

Chenhao Dang 1,2, Jing Ma 3

0 citations · 45 references · arXiv

α

Published on arXiv

2511.07888

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

MC²F achieves state-of-the-art adversarial robustness across three datasets and multiple attack methods while fully preserving — and in some cases improving — clean data accuracy, breaking the robustness-performance trade-off

MC²F (Manifold-Correcting Causal Flow)

Novel technique introduced


A persistent challenge in text classification (TC) is that enhancing model robustness against adversarial attacks typically degrades performance on clean data. We argue that this challenge can be resolved by modeling the distribution of clean samples in the encoder embedding manifold. To this end, we propose the Manifold-Correcting Causal Flow (MC^2F), a two-module system that operates directly on sentence embeddings. A Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns the density of the clean data manifold. It identifies out-of-distribution embeddings, which are then corrected by a Geodesic Purification Solver. This solver projects adversarial points back onto the learned manifold via the shortest path, restoring a clean, semantically coherent representation. We conducted extensive evaluations on text classification (TC) across three datasets and multiple adversarial attacks. The results demonstrate that our method, MC^2F, not only establishes a new state-of-the-art in adversarial robustness but also fully preserves performance on clean data, even yielding modest gains in accuracy.


Key Contributions

  • Empirical demonstration that clean and adversarial text embeddings are geometrically separable in PLM representation spaces, reframing adversarial defense as a geometric purification problem
  • Stratified Riemannian Continuous Normalizing Flow (SR-CNF) that models the clean data manifold density and identifies out-of-distribution adversarial embeddings
  • Geodesic Purification Solver that projects flagged adversarial embeddings back onto the learned clean manifold via shortest-path (geodesic) optimization, resolving the robustness-accuracy trade-off

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial evasion attacks (TextFooler, BERT-Attack, BAE, TextBugger) on text classifiers by detecting OOD embeddings and purifying them back onto the clean data manifold — a direct input purification defense against inference-time adversarial examples in NLP.


Details

Domains
nlp
Model Types
transformer
Threat Tags
inference_timedigitalblack_boxwhite_box
Applications
text classification