defense 2025

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Jinhwa Kim , Ian G. Harris

0 citations

α

Published on arXiv

2508.10031

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Context Filtering reduces jailbreak Attack Success Rates by up to 88% across six attacks while preserving original LLM helpfulness, achieving state-of-the-art Safety and Helpfulness Product scores.

Context Filtering

Novel technique introduced


While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88% while maintaining the original LLMs' performance, achieving state-of-the-art Safety and Helpfulness Product results. Notably, our model is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves. We will make our model publicly available for research purposes.


Key Contributions

  • Context Filtering: a plug-and-play input pre-processing defense that strips adversarial context from prompts while preserving the user's primary intent
  • Reduces Attack Success Rate by up to 88% across six different jailbreak attacks without fine-tuning the base LLM
  • Achieves state-of-the-art Safety and Helpfulness Product, minimizing the safety-helpfulness trade-off common in prior defenses

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxwhite_boxinference_time
Datasets
AlpacaEval
Applications
llm safetychatbot safety