defense 2025

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

Md. Mehedi Hasan , Ziaur Rahman , Rafid Mostafiz , Md. Abir Hossain

0 citations · 38 references · arXiv

α

Published on arXiv

2510.22628

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 99.96% detection rate (AUC=1.00, F1=1.00) with ASR of only 0.004%, outperforming LlamaGuard-2 (1.3% ASR) and OpenAI Moderation (3.7% ASR).

Sentra-Guard

Novel technique introduced


This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.


Key Contributions

  • Classifier-retriever fusion module combining FAISS-indexed SBERT embeddings with fine-tuned transformer classifiers to compute context-aware risk scores for adversarial prompts
  • Language-agnostic multilingual preprocessing layer that auto-translates non-English prompts, enabling consistent jailbreak detection across 100+ languages
  • Human-in-the-loop (HITL) feedback loop maintaining an evolving dual-labeled knowledge base for continual adaptation under adversarial pressure

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
Malicious Instruct benchmark
Applications
llm safety guardrailsjailbreak detectionprompt injection mitigation