AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.

Key Contributions

Identifies and fixes critical bugs in tokenization filtering, chat templates, and batched generation in existing tools, achieving up to 28% ASR improvement from correctness fixes alone.
Unified framework implementing 12 adversarial attack algorithms (discrete, continuous, hybrid), 7 benchmark datasets covering harmfulness and over-refusal, and 13 automated judges via companion package JudgeZoo.
Advanced reproducibility features including compute-resource tracking, deterministic results, Monte Carlo distributional evaluation, and 2.12× more consistent batched generation.

🛡️ Threat Analysis

Input Manipulation Attack

The toolbox explicitly implements continuous/gradient-based attack methods (e.g., GCG-style adversarial suffix optimization) alongside discrete and hybrid approaches, covering the token-level perturbation attack paradigm for LLMs.