tool 2025

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Simone Corbo ¹, Luca Bancale ¹, Valeria De Gennaro ¹, Livia Lestingi ¹, Vincenzo Scotti ², Matteo Camilli ¹

¹ Politecnico di Milano

² Karlsruhe Institute of Technology

0 citations · IEEE Transactions on Software ...

Published on arXiv

2501.01741

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

EvoTox achieves significantly higher detected toxicity than all baselines (effect size up to 1.0 vs. random search, 0.99 vs. adversarial attacks) with a modest 22–35% cost overhead across five state-of-the-art LLMs.

EvoTox

Novel technique introduced

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).

Key Contributions

EvoTox: an iterative evolutionary testing framework that co-opts a Prompt Generator LLM to steer a System Under Test LLM toward increasingly toxic responses, scored by an automated toxicity oracle
Empirical evaluation of four EvoTox variants against baselines (random search, curated toxic prompt datasets, adversarial attacks) across five LLMs ranging from 7B to 671B parameters
Demonstrates effect sizes up to 1.0 over random search and 0.99 over adversarial attacks with only 22–35% average cost overhead

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safety testingtoxicity evaluationai alignment auditing

Read PDF arXiv DOI

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Quantifying Document Impact in RAG-LLMs

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Proactive Hardening of LLM Defenses with HASTE

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

CALM: Curiosity-Driven Auditing for Large Language Models

SGuard-v1: Safety Guardrail for Large Language Models