defense 2026

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Zhiyuan Chang ^1,2,3, Mingyang Li ^1,2,3, Yuekai Huang ^1,2,3, Ziyou Jiang ^1,2,3, Xiaojun Jia ⁴, Qian Xiong ⁵, Junjie Wang ^1,2,3, Zhaoyang Li ^1,2,3, Qing Wang ^1,2,3

¹ State Key Laboratory of Complex System Modeling and Simulation Technology

² Institute of Software Chinese Academy of Sciences

³ University of Chinese Academy of Sciences

⁴ Nanyang Technological University

⁵ Beijing Forestry University

0 citations · 32 references · arXiv

Published on arXiv

2601.04666

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

InstruCoT significantly outperforms baselines across all three evaluation dimensions (Behavior Deviation, Privacy Leakage, Harmful Output) on four LLMs while preserving downstream utility

InstruCoT

Novel technique introduced

Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation

Key Contributions

Diverse training data synthesis covering multiple prompt injection vectors and injection positions to improve generalization of the defense
Instruction-level chain-of-thought fine-tuning (InstruCoT) that teaches LLMs to explicitly reason about and reject malicious injected instructions
Evaluation framework across three dimensions — Behavior Deviation, Privacy Leakage, and Harmful Output — benchmarked on four LLMs with no utility degradation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_timeblack_box

Applications

llm-integrated applicationschatbotrag systems

Read PDF arXiv DOI

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO