attack 2025

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Onkar Shelar , Travis Desell

Rochester Institute of Technology

0 citations · 22 references · arXiv

Published on arXiv

2511.12487

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Lexical substitutions achieve the best yield-variance trade-off among operators, and elite prompts evolved on LLaMA 3.1 8B transfer cross-model with toxicity roughly halving on most targets.

ToxSearch

Novel technique introduced

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

Key Contributions

ToxSearch: a synchronous steady-state evolutionary framework that evolves adversarial prompts using diverse operators (lexical substitution, negation, back-translation, paraphrasing, semantic crossover) guided by a moderation oracle
Operator-level analysis showing lexical substitutions yield the best yield-variance trade-off while global rewrites exhibit high variance and elevated refusal costs
Cross-model transfer study showing elite prompts evolved on LLaMA 3.1 8B retain attenuated but meaningful toxicity on other architectures, with smaller LLaMA 3.2 variants showing strongest resistance

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

LLaMA 3.1 8B (evolution target)LLaMA 3.2 variants (transfer targets)

Applications

llm safety testingred-teamingtoxicity elicitation

Read PDF arXiv DOI

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Jailbreaking in the Haystack

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Special-Character Adversarial Attacks on Open-Source Language Model

Chain-of-Thought Hijacking

SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda