defense 2025

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

Md. Mehedi Hasan , Ziaur Rahman , Rafid Mostafiz , Md. Abir Hossain

0 citations · 38 references · arXiv

Published on arXiv

2510.22628

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 99.96% detection rate (AUC=1.00, F1=1.00) with ASR of only 0.004%, outperforming LlamaGuard-2 (1.3% ASR) and OpenAI Moderation (3.7% ASR).

Sentra-Guard

Novel technique introduced

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.

Key Contributions

Classifier-retriever fusion module combining FAISS-indexed SBERT embeddings with fine-tuned transformer classifiers to compute context-aware risk scores for adversarial prompts
Language-agnostic multilingual preprocessing layer that auto-translates non-English prompts, enabling consistent jailbreak detection across 100+ languages
Human-in-the-loop (HITL) feedback loop maintaining an evolving dual-labeled knowledge base for continual adaptation under adversarial pressure

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

Malicious Instruct benchmark

Applications

llm safety guardrailsjailbreak detectionprompt injection mitigation

Read PDF arXiv DOI

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models