AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
Tim Beyer 1,2, Jonas Dornbusch 1,2, Jakob Steimle 1,2, Moritz Ladenburger 1,2, Leo Schwinn 1,2, Stephan Günnemann 1,2
Published on arXiv
2511.04316
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Correctness fixes to tokenization filtering and chat templates alone yield up to 28% improvement in attack success rate compared to existing buggy implementations.
AdversariaLLM
Novel technique introduced
The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.
Key Contributions
- Identifies and fixes critical bugs in tokenization filtering, chat templates, and batched generation in existing tools, achieving up to 28% ASR improvement from correctness fixes alone.
- Unified framework implementing 12 adversarial attack algorithms (discrete, continuous, hybrid), 7 benchmark datasets covering harmfulness and over-refusal, and 13 automated judges via companion package JudgeZoo.
- Advanced reproducibility features including compute-resource tracking, deterministic results, Monte Carlo distributional evaluation, and 2.12× more consistent batched generation.
🛡️ Threat Analysis
The toolbox explicitly implements continuous/gradient-based attack methods (e.g., GCG-style adversarial suffix optimization) alongside discrete and hybrid approaches, covering the token-level perturbation attack paradigm for LLMs.