attack 2025

Semantic Representation Attack against Aligned Large Language Models

Jiawei Lian ^1,2, Jianhong Pan ¹, Lefan Wang ², Yi Wang ¹, Shaohui Mei ², Lap-Pui Chau ¹

¹ The Hong Kong Polytechnic University

² Northwestern Polytechnical University

1 citations · 77 references · arXiv

Published on arXiv

2509.19360

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 89.41% average attack success rate across 18 LLMs (100% on 11 models) while maintaining prompt naturalness, outperforming existing jailbreak methods

Semantic Representation Attack (SRA) / Semantic Representation Heuristic Search (SRHS)

Novel technique introduced

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

Key Contributions

Semantic Representation Attack (SRA) paradigm that targets the semantic representation space of harmful responses rather than exact affirmative text patterns, resolving the efficacy-naturalness trade-off
Semantic Representation Heuristic Search (SRHS) algorithm with theoretical guarantees for coherence preservation and semantic convergence during incremental adversarial prompt expansion
89.41% average attack success rate across 18 LLMs with 100% ASR on 11 models, outperforming prior jailbreak methods while maintaining prompt naturalness and stealthiness

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

safety-aligned llmschatbotsai assistants

Read PDF arXiv DOI Code

Semantic Representation Attack against Aligned Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

The Echo Chamber Multi-Turn LLM Jailbreak

From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software