defense 2026

Closing the Distribution Gap in Adversarial Training for LLMs

Firstname1 Lastname1 ¹, Firstname2 Lastname2 ^1,2, Firstname3 Lastname3 ², Firstname4 Lastname4 ³, Firstname5 Lastname5 ¹, Firstname6 Lastname6 ^3,1,2, Firstname7 Lastname7 ², Firstname8 Lastname8 ³, Firstname8 Lastname8 ^1,2

¹ University of YYY

² Company Name

³ Institute of WWW

0 citations · 49 references

Published on arXiv

2602.15238

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DAT achieves substantially higher adversarial robustness against in-distribution exploits (e.g., past-tense rewrites, language translations) compared to previous adversarial training methods for LLMs.

DAT (Distributional Adversarial Training)

Novel technique introduced

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

Key Contributions

Identifies that current adversarial training for LLMs fails due to inadequate coverage of the true data distribution, leaving models vulnerable to simple in-distribution prompt rewrites
Proposes Distributional Adversarial Training (DAT), which leverages Diffusion LLMs to approximate the joint distribution of prompts and responses for diverse, high-likelihood sample generation
Combines distribution-aware sampling with continuous adversarial training to achieve substantially higher robustness than prior adversarial training methods

🛡️ Threat Analysis

Details

Domains

nlpgenerative

Model Types

llmdiffusion

Threat Tags

training_timeinference_timeblack_box

Applications

llm safetychatbotinstruction-following models

Read PDF arXiv

Closing the Distribution Gap in Adversarial Training for LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Towards Provably Secure Generative AI: Reliable Consensus Sampling

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Diffusion LLMs are Natural Adversaries for any LLM

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks