attack arXiv Dec 9, 2025 · Dec 2025
Sampriti Soor, Suklav Ghosh, Arijit Sur · Indian Institute of Technology Guwahati
Gradient-optimized universal adversarial token suffixes degrade LLM classifiers across tasks and model families via Gumbel-Softmax relaxation
Input Manipulation Attack Prompt Injection nlp
Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
llm transformer Indian Institute of Technology Guwahati
attack arXiv Dec 9, 2025 · Dec 2025
Sampriti Soor, Suklav Ghosh, Arijit Sur · arXiv · Indian Institute of Technology Guwahati
RL-trained adversarial suffixes degrade LLM classification accuracy using PPO and calibrated cross-entropy, outperforming gradient-based triggers in transferability
Input Manipulation Attack nlp
Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
llm transformer arXiv · Indian Institute of Technology Guwahati