defense 2026

Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber , Hosna Oyarhoseini , Jimmy Lin

0 citations · 67 references · arXiv

α

Published on arXiv

2602.00857

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Combining complementary adversarial training methods yields strong cross-attack generalization for text scoring models and adversarially trained reward models reduce reward hacking while producing better-aligned LLMs.

Content Injection Adversarial Training

Novel technique introduced


Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.


Key Contributions

  • Unifying adversarial robustness study across dense retrievers, rerankers, and reward models under the text scoring lens with crisp, testable failure conditions
  • Introduction of adversarial training against content injection attacks — a previously unaddressed threat in this setting — alongside PGD, HotFlip, and rudimentary adversarial training
  • Demonstration that combining complementary adversarial training methods improves both robustness and task effectiveness, with adversarially trained reward models mitigating reward hacking in RLHF

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution is adversarial attacks (PGD, HotFlip, GCG-style gradient-guided token manipulation) and defenses (adversarial training methods) targeting text scoring models at inference time — classic input manipulation / evasion attack framework.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_timetraining_time
Applications
dense retrievaldocument rerankingreward modelingrlhf