A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
Ivan Zhang 1,2
Published on arXiv
2508.07139
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
RTST adapts to novel jailbreaks in real-time using only LLM outputs, outperforming traditional fine-tuning and static classifier defenses while maintaining low computational overhead.
RTST (Real-Time, Self-Tuning Moderator)
Novel technique introduced
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.
Key Contributions
- RTST (Real-Time, Self-Tuning) moderator framework using an Evaluator + Reviewer two-agent architecture that adapts to novel jailbreaks in real-time from a single prompt
- A behavior-weight system inspired by neural-network optimization that updates without expensive model retraining, maintaining a lightweight computational footprint
- Empirical evaluation of RTST against modern jailbreaks (PAIR, AutoDAN, TAP, DarkMind, Policy Puppetry, FlipAttack, etc.) using Google Gemini, showing advantages over static fine-tuning and classifier-based defenses