attack arXiv Nov 3, 2025 · Nov 2025
Sampriti Soor, Alik Pramanick, Jothiprakash K et al. · Indian Institute of Technology Guwahati · Kalinga Institute of Industrial Technology
GAN + CLIP-guided black-box adversarial attack on multilabel classifiers using saliency and text-embedding loss
Input Manipulation Attack visionmultimodal
The rapid growth of deep learning has brought about powerful models that can handle various tasks, like identifying images and understanding language. However, adversarial attacks, an unnoticed alteration, can deceive models, leading to inaccurate predictions. In this paper, a generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations. The CLIP model's ability to align text and image representation helps incorporate natural language semantics with a guided loss to generate effective adversarial examples that look identical to the original inputs. This integration allows extensive scene manipulation, creating perturbations in multi-object environments specifically designed to deceive multilabel classifiers. Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder (SSAE) with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA), resulting in perturbations that both deceive classification models and maintain high structural similarity to the original images. The model was tested on various tasks across diverse black-box victim models. The experimental results show that our method performs competitively, achieving comparable or superior results to existing techniques, while preserving greater visual fidelity.
gan transformer Indian Institute of Technology Guwahati · Kalinga Institute of Industrial Technology
attack arXiv Dec 9, 2025 · Dec 2025
Sampriti Soor, Suklav Ghosh, Arijit Sur · Indian Institute of Technology Guwahati
Gradient-optimized universal adversarial token suffixes degrade LLM classifiers across tasks and model families via Gumbel-Softmax relaxation
Input Manipulation Attack Prompt Injection nlp
Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
llm transformer Indian Institute of Technology Guwahati
attack arXiv Dec 9, 2025 · Dec 2025
Sampriti Soor, Suklav Ghosh, Arijit Sur · arXiv · Indian Institute of Technology Guwahati
RL-trained adversarial suffixes degrade LLM classification accuracy using PPO and calibrated cross-entropy, outperforming gradient-based triggers in transferability
Input Manipulation Attack nlp
Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
llm transformer arXiv · Indian Institute of Technology Guwahati