attack 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun ¹, Zongmin Zhang ¹, Deqi Liang ¹, Han Sun ², Yule Liu ¹, Yun Shen ³, Xiangshan Gao ⁴, Yilong Yang ⁵, Shuai Liu ⁶, Yutao Yue ^1,7, Xinlei He ¹

¹ The Hong Kong University of Science and Technology

² East China Normal University

³ Flexera

⁴ Zhejiang University

⁵ Xidian University

⁶ Xi’an Jiaotong University

⁷ Institute of Deep Perception Technology, JITRI

2 citations · 46 references · arXiv

Published on arXiv

2511.16278

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GTA achieves over 95% attack success rate on GPT-4o and DeepSeek-R1 using game-theoretic scenario templates, while also evading prompt-guard models when paired with a Harmful-Words Detection Agent.

GTA (Game-Theory Attack)

Novel technique introduced

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

Key Contributions

Formalizes black-box jailbreaking as a finite-horizon sequential stochastic game and introduces the 'template-over-safety flip' behavioral conjecture explaining how game-theoretic scenarios override LLM safety preferences.
Designs a Mechanism-Induced Graded Prisoner's Dilemma jailbreak template and an adaptive Attacker Agent that escalates pressure based on interaction feedback to maximize ASR.
Demonstrates >95% ASR on GPT-4o and DeepSeek-R1 with fewer queries than prior multi-round attacks, and reports longitudinal safety monitoring of popular HuggingFace LLMs with average ASR above 86%.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBenchHarmBench

Applications

large language modelssafety-aligned chatbotsllm apis

Read PDF arXiv DOI Code

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Large Reasoning Models Are Autonomous Jailbreak Agents

Boundary Point Jailbreaking of Black-Box LLMs

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

The Echo Chamber Multi-Turn LLM Jailbreak

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling