attack 2025

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Sampriti Soor , Suklav Ghosh , Arijit Sur

0 citations · 19 references · arXiv

α

Published on arXiv

2512.08123

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A single adversarial suffix trained on one model (e.g., Qwen2-1.5B) transfers to Phi-1.5 and TinyLlama-1.1B, consistently reducing accuracy and calibrated confidence across five heterogeneous NLP tasks.

Calibrated Gumbel-Softmax Adversarial Suffix

Novel technique introduced


Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.


Key Contributions

  • Universal adversarial suffix optimization via Gumbel-Softmax continuous relaxation, enabling stable gradient-based training over discrete token sequences
  • Calibrated contrastive training objective that contrasts context-dependent and null-prompt predictions to counteract label-prior bias, with entropy regularization and forbid-masks
  • Multi-task training setup that produces a single suffix transferable across sentiment analysis, NLI, paraphrase detection, commonsense QA, and physical reasoning tasks on multiple model families

🛡️ Threat Analysis

Input Manipulation Attack

The paper optimizes adversarial token suffixes via gradient-based Gumbel-Softmax relaxation — token-level perturbations at inference time that cause misclassification, a canonical ML01 adversarial suffix attack.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeteddigital
Datasets
SSTNLI benchmarksparaphrase detection benchmarkscommonsense QA benchmarksphysical reasoning benchmarks
Applications
text classificationsentiment analysisnatural language inferenceparaphrase detectioncommonsense qa