Xinlei He

attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp

2 citations PDF Code

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

llm The Hong Kong University of Science and Technology · East China Normal University · Flexera +4 more

PDF arXiv DOI Code

attack arXiv Dec 22, 2025 · Dec 2025

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun et al. · The University of Hong Kong · The Hong Kong University of Science and Technology +2 more

Backdoor attack on 6DoF pose estimation using 3D object triggers to induce controlled erroneous rotations and translations with 100% ASR

Model Poisoning vision

1 citations PDF Code

attack arXiv Nov 18, 2025 · Nov 2025

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

Yule Liu, Heyi Zhang, Jinyi Zheng et al. · The Hong Kong University of Science and Technology · Shanghai Jiao Tong University +2 more

First membership inference attack against RLVR-trained LLMs using behavioral divergence signals instead of memorization

Membership Inference Attack nlpmultimodalreinforcement-learning

1 citations PDF

Papers in Database (3)

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards