Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Ryan Wong , Hosea David Yu Fei Ng , Dhananjai Sharma , Glenn Jun Jie Ng , Kavishvaran Srinivasan
Published on arXiv
2511.18933
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
The agent-based defense achieves full mitigation of jailbreak attacks on benchmark datasets, with all three strategies substantially reducing attack success rate.
Logit-Based Steering Defense
Novel technique introduced
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
Key Contributions
- Systematic taxonomy of existing LLM jailbreak defenses across prompt-level, model-level, and training-time interventions
- Logit-Based Steering Defense that reinforces refusal behavior via inference-time vector steering in safety-sensitive layers
- Domain-Specific Agent Defense using MetaGPT's structured role-based collaboration to achieve full attack mitigation on benchmarks