benchmark 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu ¹, Handing Wang ¹, Yi Mei ², Mengjie Zhang ², Yaochu Jin ³

¹ Xidian University

² Victoria University of Wellington

³ Westlake University

0 citations · 41 references · arXiv

Published on arXiv

2601.00213

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MOBjailbreak achieves near-complete bypass of 13 mainstream LLMs on malicious algorithm design requests, with an 83.59% average attack success rate on unmodified prompts and existing plug-and-play defenses proving only marginally effective

MOBjailbreak

Novel technique introduced

The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Key Contributions

MalOptBench: a benchmark of 60 malicious intelligent optimization algorithm requests spanning four task categories, generated via a two-stage LLM-based adversarial pipeline
MOBjailbreak: a surrogate-model-based jailbreak method that rewrites harmful algorithm design prompts into benign-sounding expressions to bypass target LLM safety filters
Evaluation of 13 mainstream LLMs (including GPT-5, DeepSeek-V3.1) showing 83.59% average attack success rate on original prompts and near-complete failure under MOBjailbreak, with state-of-the-art defenses found only marginally effective

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

MalOptBench

Applications

automated algorithm designllm safety evaluationintelligent optimization

Read PDF arXiv DOI

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Quantifying CBRN Risk in Frontier Models

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models