attack 2025

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Yifan Huang ¹, Xiaojun Jia ¹, Wenbo Guo ¹, Yuqiang Sun ¹, Yihao Huang ², Chong Wang ¹, Yang Liu ¹

¹ Nanyang Technological University

² National University of Singapore

1 citations · 48 references · arXiv

Published on arXiv

2512.21236

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves attack success rates of 83.75%, 19.38%, and 68.12% on GPT-4.1, Claude-3.5, and Qwen2.5-Coder respectively, with generated malicious code confirmed by state-of-the-art detection tools at rates exceeding 73%.

SPELL

Novel technique introduced

Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. However, this accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target. To address this gap, we propose SPELL, a comprehensive testing framework specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL's effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.

Key Contributions

SPELL framework that uses a time-division selection strategy to systematically construct jailbreaking prompts by combining sentences from a prior knowledge dataset, balancing exploration and exploitation of successful techniques.
First comprehensive evaluation of LLM safety alignment specifically targeting malicious code generation across eight distinct malicious code categories (malware, ransomware, etc.).
Empirical demonstration on GPT-4.1, Claude-3.5, and Qwen2.5-Coder, with outputs validated as genuinely malicious by external detection systems at rates exceeding 73%, including in real-world tools like Cursor.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Applications

code generationai coding assistantsmalware generation

Read PDF arXiv DOI

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Semantic Representation Attack against Aligned Large Language Models

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

The Echo Chamber Multi-Turn LLM Jailbreak

From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software