attack 2025

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

Mingrui Liu , Sixiao Zhang , Cheng Long , Kwok Yan Lam

2 citations · 57 references · arXiv

α

Published on arXiv

2510.21190

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TrojFill achieves 100% Attack Success Rate on Gemini-flash-2.5 and DeepSeek-3.1 and 97% on GPT-4o using a transferable, interpretable black-box natural language jailbreak strategy.

TrojFill

Novel technique introduced


As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated payloads (e.g., via placeholder substitution) into a "Trojan" structure, the attack induces the model to generate prohibited content as a "demonstrative example" ostensibly required for a subsequent sentence-by-sentence safety critique. This approach effectively masks the malicious intent from standard intent classifiers. We evaluate TrojFill against representative commercial systems, including GPT-4o, Gemini-2.5, DeepSeek-3.1, and Qwen-Max. Our results demonstrate that TrojFill achieves near-universal bypass rates: reaching 100% Attack Success Rate (ASR) on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o, significantly outperforming existing black-box baselines. Furthermore, unlike optimization-based adversarial prompts, TrojFill generates highly interpretable and transferable attack vectors, exposing a systematic vulnerability inaligned LLMs.


Key Contributions

  • TrojFill: a black-box jailbreak framework that exploits the structural decoupling between unsafety reasoning and content generation by embedding obfuscated payloads in a 'Trojan' template-filling prompt ostensibly for safety critique
  • Near-universal bypass rates: 100% ASR on Gemini-flash-2.5 and DeepSeek-3.1, 97% on GPT-4o, significantly outperforming existing black-box baselines
  • Attack vectors are interpretable and cross-model transferable, unlike optimization-based adversarial prompts that produce syntactically rigid or model-specific outputs

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
commercial llm apisllm safety alignment