The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated payloads (e.g., via placeholder substitution) into a "Trojan" structure, the attack induces the model to generate prohibited content as a "demonstrative example" ostensibly required for a subsequent sentence-by-sentence safety critique. This approach effectively masks the malicious intent from standard intent classifiers. We evaluate TrojFill against representative commercial systems, including GPT-4o, Gemini-2.5, DeepSeek-3.1, and Qwen-Max. Our results demonstrate that TrojFill achieves near-universal bypass rates: reaching 100% Attack Success Rate (ASR) on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o, significantly outperforming existing black-box baselines. Furthermore, unlike optimization-based adversarial prompts, TrojFill generates highly interpretable and transferable attack vectors, exposing a systematic vulnerability inaligned LLMs.

Key Contributions

TrojFill: a black-box jailbreak framework that exploits the structural decoupling between unsafety reasoning and content generation by embedding obfuscated payloads in a 'Trojan' template-filling prompt ostensibly for safety critique
Near-universal bypass rates: 100% ASR on Gemini-flash-2.5 and DeepSeek-3.1, 97% on GPT-4o, significantly outperforming existing black-box baselines
Attack vectors are interpretable and cross-model transferable, unlike optimization-based adversarial prompts that produce syntactically rigid or model-specific outputs