HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate
Published on arXiv
2512.23717
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
HarmTransform's multi-agent debate framework significantly outperforms standard baselines in producing stealthy harmful query transformations that evade LLM safety mechanisms while preserving malicious intent.
HarmTransform
Novel technique introduced
Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.
Key Contributions
- HarmTransform: the first multi-agent debate framework specifically designed to transform explicit harmful queries into stealthier forms while preserving their underlying malicious intent
- Comprehensive evaluation protocol and in-depth analysis of debate dynamics, identifying when debate improves stealth and when it causes regressions (topic shift, information overload)
- Empirical demonstration that multi-agent debate significantly outperforms standard single-agent baselines for harmful query stealth transformation