defense 2026

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

0 citations

Published on arXiv

2604.06436

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Any continuous wrapper defense D: X→X that preserves utility (leaves safe prompts unchanged) must fix at least one boundary point at safety threshold τ, and under transversality conditions, a positive-measure region remains strictly unsafe

Defense Trilemma

Novel technique introduced

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Key Contributions

Proves defense trilemma: continuous, utility-preserving wrapper defenses cannot achieve complete safety on connected prompt spaces
Establishes three impossibility results under successively stronger assumptions: boundary fixation, ε-robust constraint, and persistent unsafe region
Provides mechanically verified proofs in Lean 4 with ~350 theorems and empirical validation on three LLMs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Applications

prompt injection defensellm safety

Read PDF arXiv Code

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks

Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems

Invasive Context Engineering to Control Large Language Models

Automating Cloud Security and Forensics Through a Secure-by-Design Generative AI Framework

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI