defense 2026

The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

J Alex Corll

0 citations

α

Published on arXiv

2603.11875

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A 5K-sample Mirror-trained linear SVM achieves 95.97% recall and 92.07% F1 at sub-millisecond latency, versus 44.35% recall and 59.14% F1 for the 22M-parameter Prompt Guard 2 baseline at 49ms median latency

Mirror

Novel technique introduced


Prompt injection defenses are often framed as semantic understanding problems and delegated to increasingly large neural detectors. For the first screening layer, however, the requirements are different: the detector runs on every request and therefore must be fast, deterministic, non-promptable, and auditable. We introduce Mirror, a data-curation design pattern that organizes prompt injection corpora into matched positive and negative cells so that a classifier learns control-plane attack mechanics rather than incidental corpus shortcuts. Using 5,000 strictly curated open-source samples -- the largest corpus supportable under our public-data validity contract -- we define a 32-cell mirror topology, fill 31 of those cells with public data, train a sparse character n-gram linear SVM, compile its weights into a static Rust artifact, and obtain 95.97\% recall and 92.07\% F1 on a 524-case holdout at sub-millisecond latency with no external model runtime dependencies. On the same holdout, our next line of defense, a 22-million-parameter Prompt Guard~2 model reaches 44.35\% recall and 59.14\% F1 at 49\,ms median and 324\,ms p95 latency. Linear models still leave residual semantic ambiguities such as use-versus-mention for later pipeline layers, but within that scope our results show that for L1 prompt injection screening, strict data geometry can matter more than model scale.


Key Contributions

  • Mirror: a data-curation design pattern that organizes prompt injection corpora into matched positive/negative geometric cells so classifiers learn attack mechanics rather than corpus shortcuts
  • A 32-cell mirror topology trained sparse character n-gram linear SVM compiled into a static Rust artifact with no external model runtime dependencies
  • Empirical demonstration that strict data geometry at 5K samples outperforms a 22M-parameter Prompt Guard 2 model (95.97% vs 44.35% recall) at sub-millisecond vs 49ms median latency

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtraditional_ml
Threat Tags
inference_timeblack_box
Datasets
Mirror corpus (5,000 curated samples)Mirror holdout (524 cases)Prompt Guard 2 evaluation set
Applications
llm prompt injection screeningllm security guardrails