defense 2026

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Zhiyuan Chang 1,2,3, Mingyang Li 1,2,3, Yuekai Huang 1,2,3, Ziyou Jiang 1,2,3, Xiaojun Jia 4, Qian Xiong 5, Junjie Wang 1,2,3, Zhaoyang Li 1,2,3, Qing Wang 1,2,3

0 citations · 32 references · arXiv

α

Published on arXiv

2601.04666

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

InstruCoT significantly outperforms baselines across all three evaluation dimensions (Behavior Deviation, Privacy Leakage, Harmful Output) on four LLMs while preserving downstream utility

InstruCoT

Novel technique introduced


Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation


Key Contributions

  • Diverse training data synthesis covering multiple prompt injection vectors and injection positions to improve generalization of the defense
  • Instruction-level chain-of-thought fine-tuning (InstruCoT) that teaches LLMs to explicitly reason about and reject malicious injected instructions
  • Evaluation framework across three dimensions — Behavior Deviation, Privacy Leakage, and Harmful Output — benchmarked on four LLMs with no utility degradation

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_timeblack_box
Applications
llm-integrated applicationschatbotrag systems