defense 2026

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova 1, Bogdan Minko 2, Sabrina Sadiekh 2, Evgeniy Kokuykin 2

0 citations

α

Published on arXiv

2604.25716

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves AUC up to 0.99 on template-based jailbreaks but degrades to 0.60-0.70 under distribution shift to heterogeneous attacks

Semantic Codebook Detection

Novel technique introduced


Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.


Key Contributions

  • Training-free cross-lingual jailbreak detection using semantic codebooks with multilingual embeddings
  • Systematic evaluation across 4 languages, 4 safety benchmarks, 3 embedding models, and 3 LLMs
  • Identifies performance degradation under distribution shift from template-based to behaviorally diverse attacks

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
curated jailbreak benchmarksdiverse unsafe benchmarks
Applications
llm safetymultilingual content moderationjailbreak detection