defense 2026

Trust The Typical

Debargha Ganguly ¹, Sreehari Sankar ¹, Biyao Zhang ¹, Vikash Singh ¹, Kanan Gupta ², Harshini Kavuru ³, Alan Luo ¹, Weicong Chen ¹, Warren Morningstar ⁴, Raghu Machiraju ³, Vipin Chaudhary ¹

¹ Case Western Reserve University

² University of Pittsburgh

³ The Ohio State University

⁴ Google Research

1 citations · 80 references · arXiv (Cornell University)

Published on arXiv

2602.04581

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves state-of-the-art safety detection across 18 benchmarks while reducing false positive rates by up to 40x compared to specialized safety models, with zero training on harmful examples.

T3 (Trust The Typical)

Novel technique introduced

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.

Key Contributions

T3 framework that reformulates LLM safety as OOD detection over a learned distribution of safe prompts, requiring no harmful training examples
Zero-shot transfer to 14+ languages and diverse domains from a single model trained only on safe English text
Production-ready vLLM integration enabling continuous token-generation-time guardrailing with less than 6% overhead

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

AdvBenchHarmBenchToxiGenJailbreakBench

Applications

llm safety guardrailsjailbreak detectiontoxicity filteringcontent moderation

Read PDF arXiv DOI

Trust The Typical

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

GAVEL: Towards rule-based safety through activation monitoring

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

A Lightweight Explainable Guardrail for Prompt Safety

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations