defense 2025

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Md. Hasib Ur Rahman

Brac University

0 citations · 13 references · arXiv

Published on arXiv

2512.13741

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Variance of layer-wise cosine velocity distinguishes jailbreak prompts from benign inputs with a 75.4% turbulence increase in RLHF-aligned Qwen2-1.5B (p < 0.001), enabling lightweight real-time detection without auxiliary classifiers.

Semantic Turbulence (Laminar Flow Hypothesis)

Novel technique introduced

As Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical filters, overlooking the intrinsic dynamics of the model's reasoning process. In this work, the Laminar Flow Hypothesis is introduced, which posits that benign inputs induce smooth, gradual transitions in an LLM's high-dimensional latent space, whereas adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence - resulting from the internal conflict between safety alignment and instruction-following objectives. This phenomenon is formalized through a novel, zero-shot metric: the variance of layer-wise cosine velocity. Experimental evaluation across diverse small language models reveals a striking diagnostic capability. The RLHF-aligned Qwen2-1.5B exhibits a statistically significant 75.4% increase in turbulence under attack (p less than 0.001), validating the hypothesis of internal conflict. Conversely, Gemma-2B displays a 22.0% decrease in turbulence, characterizing a distinct, low-entropy "reflex-based" refusal mechanism. These findings demonstrate that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool for categorizing the underlying safety architecture of black-box models.

Key Contributions

Introduces the Laminar Flow Hypothesis and formalizes Semantic Turbulence (variance of layer-wise cosine velocity) as a zero-shot, training-free jailbreak detection metric
Empirically validates that RLHF-aligned models (Qwen2-1.5B) exhibit a statistically significant 75.4% turbulence increase under jailbreak attacks, while Gemma-2B shows a 22.0% decrease consistent with a 'reflex-based' refusal mechanism
Demonstrates Semantic Turbulence as a non-invasive diagnostic tool for categorizing underlying LLM safety architectures (RLHF conflict-based vs. SFT reflex-based) without access to training data

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

Qwen2-1.5B evaluationsGemma-2B evaluations

Applications

llm jailbreak detectionllm safety architecture diagnosis

Read PDF arXiv DOI

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification