attack 2025

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding ¹, Jun Kuang ², Wen Sun ², Zongyu Wang ², Xuezhi Cao ², Xunliang Cai ², Jiajun Chen ¹, Shujian Huang ¹

¹ Nanjing University

² Meituan

0 citations · arXiv

Published on arXiv

2511.00556

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ISA achieves over 70% improvement in attack success rate vs direct harmful prompts; fine-tuning on ISA-reformulated benign data elevates success to nearly 100%.

ISA (Intent Shift Attack)

Novel technique introduced

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

Key Contributions

Introduces ISA (Intent Shift Attack), a taxonomy-driven jailbreak that obfuscates harmful intent through minimal, human-readable text edits rather than adversarial tokens or lengthy context
Demonstrates 70%+ improvement in attack success rate over direct harmful prompts across open-source and commercial LLMs
Shows that fine-tuning models on benign ISA-reformulated data elevates attack success to nearly 100%, revealing fundamental challenges in LLM intent inference

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

llm safety mechanismschatbotsinstruction-following llms

Read PDF arXiv DOI Code

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Jailbreaking Large Language Models Through Content Concretization

Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection

BreakFun: Jailbreaking LLMs via Schema Exploitation

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment