tool 2025

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer 1,2, Jonas Dornbusch 1,2, Jakob Steimle 1,2, Moritz Ladenburger 1,2, Leo Schwinn 1,2, Stephan Günnemann 1,2

2 citations · 51 references · arXiv

α

Published on arXiv

2511.04316

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Correctness fixes to tokenization filtering and chat templates alone yield up to 28% improvement in attack success rate compared to existing buggy implementations.

AdversariaLLM

Novel technique introduced


The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.


Key Contributions

  • Identifies and fixes critical bugs in tokenization filtering, chat templates, and batched generation in existing tools, achieving up to 28% ASR improvement from correctness fixes alone.
  • Unified framework implementing 12 adversarial attack algorithms (discrete, continuous, hybrid), 7 benchmark datasets covering harmfulness and over-refusal, and 13 automated judges via companion package JudgeZoo.
  • Advanced reproducibility features including compute-resource tracking, deterministic results, Monte Carlo distributional evaluation, and 2.12× more consistent batched generation.

🛡️ Threat Analysis

Input Manipulation Attack

The toolbox explicitly implements continuous/gradient-based attack methods (e.g., GCG-style adversarial suffix optimization) alongside discrete and hybrid approaches, covering the token-level perturbation attack paradigm for LLMs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_time
Datasets
HarmBenchJailbreakBenchStrongREJECTXSTestOR-Bench
Applications
llm safety evaluationjailbreak robustness researchadversarial nlp benchmarking