benchmark 2026

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

0 citations · 36 references · arXiv (Cornell University)

Published on arXiv

2602.16346

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

STING achieves substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines, while multilingual evaluations reveal that lower-resource languages do not consistently yield higher attack success rates.

STING (Sequential Testing of Illicit N-step Goal execution)

Novel technique introduced

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Key Contributions

STING: an automated multi-turn red-teaming framework that constructs step-by-step illicit plans with adaptive follow-ups to probe LLM agents, achieving higher illicit-task completion than single-turn and chat-oriented baselines
A statistical analysis framework modeling multi-turn red-teaming as a time-to-first-jailbreak random variable, introducing discovery curves, hazard-ratio attribution by attack language, and the Restricted Mean Jailbreak Discovery (RMJD) metric
Multilingual evaluation across six non-English languages showing that attack success does not consistently increase for lower-resource languages, diverging from common chatbot findings

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

AgentHarm

Applications

llm agentsagentic ai systemstool-using language models

Read PDF arXiv DOI

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

ASTRA: Agentic Steerability and Risk Assessment Framework

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents