defense 2025

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Ryan Wong , Hosea David Yu Fei Ng , Dhananjai Sharma , Glenn Jun Jie Ng , Kavishvaran Srinivasan

0 citations · 30 references · arXiv

α

Published on arXiv

2511.18933

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The agent-based defense achieves full mitigation of jailbreak attacks on benchmark datasets, with all three strategies substantially reducing attack success rate.

Logit-Based Steering Defense

Novel technique introduced


Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project


Key Contributions

  • Systematic taxonomy of existing LLM jailbreak defenses across prompt-level, model-level, and training-time interventions
  • Logit-Based Steering Defense that reinforces refusal behavior via inference-time vector steering in safety-sensitive layers
  • Domain-Specific Agent Defense using MetaGPT's structured role-based collaboration to achieve full attack mitigation on benchmarks

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_timedigital
Datasets
AdvBench
Applications
large language modelschatbotsllm safety