attack 2025

From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation

Najrin Sultana ¹, Md Rafi Ur Rashid ¹, Kang Gu ², Shagufta Mehnaz ¹

¹ The Pennsylvania State University

² Dartmouth College

0 citations · 49 references · EMNLP

Published on arXiv

2511.03128

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LLM-guided adversarial perturbations outperform static baselines (PromptAttack, CombinedAttack) and transfer effectively to unseen LLMs, demonstrating systematic robustness weaknesses in zero-shot LLM classifiers.

StaDec / DyDec (Static Deceptor / Dynamic Deceptor)

Novel technique introduced

LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks designed to systematically generate dynamic and adaptive adversarial examples by leveraging the understanding of the LLMs. We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text while effectively deceiving the target LLM. By utilizing an automated, LLM-driven pipeline, we eliminate the dependence on external heuristics. Our attacks evolve with the advancements in LLMs and demonstrate strong transferability across models unknown to the attacker. Overall, this work provides a systematic approach for the self-assessment of an LLM's robustness. We release our code and data at https://github.com/Shukti042/AdversarialExample.

Key Contributions

Static Deceptor (StaDec) and Dynamic Deceptor (DyDec): two fully automated, LLM-driven attack pipelines that generate adaptive adversarial text examples without external heuristics or gradient access
Adversarial examples that preserve semantic similarity to original text while consistently deceiving state-of-the-art LLM classifiers (GPT-4o, Llama-3-70B) across four sensitive tasks
Demonstrated strong cross-model transferability and evaluation of three existing defenses against the proposed attacks

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Applications

spam detectionhate speech detectiontext classification

Read PDF arXiv DOI Code

From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking