defense 2025

DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing

Ko-Wei Chuang 1, Hen-Hsen Huang 2, Tsai-Yen Li 1

0 citations

α

Published on arXiv

2508.05671

Input Manipulation Attack

OWASP ML Top 10 — ML01

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

DINA significantly improves NLP model robustness and accuracy over baselines when facing both adversarial evasion attacks and label poisoning simultaneously.

DINA

Novel technique introduced


As large language models (LLMs) and generative AI become increasingly integrated into customer service and moderation applications, adversarial threats emerge from both external manipulations and internal label corruption. In this work, we identify and systematically address these dual adversarial threats by introducing DINA (Dual Defense Against Internal Noise and Adversarial Attacks), a novel unified framework tailored specifically for NLP. Our approach adapts advanced noisy-label learning methods from computer vision and integrates them with adversarial training to simultaneously mitigate internal label sabotage and external adversarial perturbations. Extensive experiments conducted on a real-world dataset from an online gaming service demonstrate that DINA significantly improves model robustness and accuracy compared to baseline models. Our findings not only highlight the critical necessity of dual-threat defenses but also offer practical strategies for safeguarding NLP systems in realistic adversarial scenarios, underscoring broader implications for fair and responsible AI deployment.


Key Contributions

  • Identifies and formalizes the dual adversarial threat in NLP: simultaneous external evasion attacks and internal label corruption by malicious annotators
  • Adapts noisy-label learning techniques from computer vision to NLP and integrates them with adversarial training into the unified DINA framework
  • Demonstrates on a real-world online gaming content moderation dataset that DINA outperforms baseline models on both robustness and accuracy

🛡️ Threat Analysis

Input Manipulation Attack

The external threat is character-level adversarial perturbation (e.g., replacing Chinese characters with visually similar ones) crafted to evade spam classifiers at inference time; adversarial training is a core component of the DINA defense.

Data Poisoning Attack

The internal threat is label poisoning by malicious annotators who corrupt fine-tuning corpora; DINA adapts noisy-label learning methods to detect and mitigate this training-time data poisoning.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_boxtargeteddigital
Datasets
proprietary online gaming service dataset
Applications
content moderationonline gaming chat filteringcustomer service nlp