benchmark 2025

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi ¹, Pinxian Lu ², Abdulkadir Erol ¹, Tanmay Sutar ², Gauri Sharma ², Mina Sonmez ³, Munmun De Choudhury ², Ugur Kursuncu ¹

¹ Georgia State University

² Georgia Institute of Technology

³ University of California, Berkeley

1 citations · 70 references · arXiv

Published on arXiv

2510.14207

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Jailbreak fine-tuning achieves 95.78–96.89% attack success rate in LLaMA and 99.33% in Gemini, reducing refusal rates to 1–2% across both models.

Online Harassment Agentic Benchmark (OHAB)

Novel technique introduced

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

Key Contributions

Synthetic multi-turn harassment conversation dataset and multi-agent simulation (harasser/victim) grounded in repeated game theory
Three jailbreak attack methods targeting LLM agents across memory, planning, and fine-tuning dimensions
Mixed-methods evaluation framework measuring attack success rate, refusal rate, and qualitative aggression profiles across open- and closed-source LLMs

🛡️ Threat Analysis

Transfer Learning Attack

Fine-tuning (jailbreak tuning) is explicitly one of three attack vectors and the most effective — it exploits the fine-tuning process to bypass safety alignment, achieving 95-99% attack success rate by embedding malicious behavior that survives to the deployed agent.

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxgrey_boxinference_timetraining_timetargeted

Datasets

synthetic multi-turn harassment conversation dataset (paper-constructed)

Applications

conversational ai agentsonline platformschatbot safety

Read PDF arXiv DOI

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications