defense 2025

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Ruofan Liu 1, Yun Lin 2, Zhiyong Huang 1, Jin Song Dong 1

1 citations · arXiv (Cornell University)

α

Published on arXiv

2511.00447

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DRIP reduces adaptive prompt injection attack success rate by over 66% and improves role separation by 12–49% while maintaining utility on par with the undefended model on LLaMA-8B and Mistral-7B.

DRIP

Novel technique introduced


Large language models (LLMs) are increasingly integrated into IT infrastructures, where they process user data according to predefined instructions. However, conventional LLMs remain vulnerable to prompt injection, where malicious users inject directive tokens into the data to subvert model behavior. Existing defenses train LLMs to semantically separate data and instruction tokens, but still struggle to (1) balance utility and security and (2) prevent instruction-like semantics in the data from overriding the intended instructions. We propose DRIP, which (1) precisely removes instruction semantics from tokens in the data section while preserving their data semantics, and (2) robustly preserves the effect of the intended instruction even under strong adversarial content. To "de-instructionalize" data tokens, DRIP introduces a data curation and training paradigm with a lightweight representation-editing module that edits embeddings of instruction-like tokens in the data section, enhancing security without harming utility. To ensure non-overwritability of instructions, DRIP adds a minimal residual module that reduces the ability of adversarial data to overwrite the original instruction. We evaluate DRIP on LLaMA 8B and Mistral 7B against StruQ, SecAlign, ISE, and PFT on three prompt-injection benchmarks (SEP, AlpacaFarm, and InjecAgent). DRIP improves role-separation score by 12-49\%, reduces attack success rate by over 66\% under adaptive attacks, and matches the utility of the undefended model, establishing a new state of the art for prompt-injection robustness.


Key Contributions

  • Token-wise representation-editing module that removes instruction semantics from data-section tokens while preserving data semantics (de-instructionalization), trained with a novel data curation and loss design paradigm
  • Residual instruction fusion module that anchors the intended instruction's influence in LLM hidden states, reducing adversarial data's ability to overwrite it
  • New SOTA prompt injection defense on LLaMA-8B and Mistral-7B: 12–49% improvement in role separation score and >66% reduction in attack success rate under adaptive attacks, with utility matching an undefended model

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_box
Datasets
SEPAlpacaFarmInjecAgentAlpacaEvalIFEvalMT-Bench
Applications
llm-integrated it systemsagentic llm pipelinesinstruction-following llms