tool 2025

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

Artur Horal , Daniel Pina , Henrique Paz , Iago Paulo , João Soares , Rafael Ferreira , Diogo Tavares , Diogo Glória-Silva , João Magalhães , David Semedo

NOVA University of Lisbon

1 citations · arXiv

Published on arXiv

2510.06994

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Multi-turn adversarial attack strategies from the RedTWIZ framework successfully elicit unsafe code and malicious outputs from state-of-the-art safety-aligned LLMs.

RedTWIZ

Novel technique introduced

This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM's vulnerabilities. Together, these contributions form a unified framework -- combining assessment, attack generation, and strategic planning -- to comprehensively evaluate and expose weaknesses in LLMs' robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM's robustness.

Key Contributions

Automated LLM jailbreak assessment module with LLM-based judges for fine-grained multi-turn conversation scoring without manual annotation
Diverse generative multi-turn attack suite covering compositional, goal-oriented, and adaptive conversational jailbreak strategies for code/cybersecurity contexts
Hierarchical adaptive attack planner that probes target LLMs, then dynamically serializes and schedules attacks tailored to each model's specific vulnerabilities

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

Amazon Nova AI Challenge

Applications

llm safety auditingcode generation llmsai-assisted software developmentcybersecurity chatbots

Read PDF arXiv DOI

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Quantifying Document Impact in RAG-LLMs

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation