defense 2025

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang 1,2

0 citations

α

Published on arXiv

2508.07139

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RTST adapts to novel jailbreaks in real-time using only LLM outputs, outperforming traditional fine-tuning and static classifier defenses while maintaining low computational overhead.

RTST (Real-Time, Self-Tuning Moderator)

Novel technique introduced


Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.


Key Contributions

  • RTST (Real-Time, Self-Tuning) moderator framework using an Evaluator + Reviewer two-agent architecture that adapts to novel jailbreaks in real-time from a single prompt
  • A behavior-weight system inspired by neural-network optimization that updates without expensive model retraining, maintaining a lightweight computational footprint
  • Empirical evaluation of RTST against modern jailbreaks (PAIR, AutoDAN, TAP, DarkMind, Policy Puppetry, FlipAttack, etc.) using Google Gemini, showing advantages over static fine-tuning and classifier-based defenses

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
JailbreakBench (JBC)LOKIHarmBench
Applications
llm safetyjailbreak defenseconversational ai moderation