defense 2025

Learning to Extract Context for Context-Aware LLM Inference

Minseon Kim , Lucas Caccia , Zhengyan Shi , Matheus Pereira , Marc-Alexandre Côté , Xingdi Yuan , Alessandro Sordoni

Microsoft Research

0 citations · 30 references · arXiv

Published on arXiv

2512.11986

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces harmful responses by 5.6% on SafetyInstruct and improves the harmonic mean of attack success rate and benign compliance by 6.2% on XSTest and WildJailbreak across multiple foundation models.

RL-based Context Generator

Novel technique introduced

User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.

Key Contributions

Autoencoder-style reinforcement learning framework that extracts latent contextual signals (intent, knowledge, risk) from user prompts before response generation
Demonstrates that context-aware inference reduces harmful responses by 5.6% on SafetyInstruct while improving benign compliance by 6.2% on XSTest and WildJailbreak
Addresses the dual failure mode of LLM safety: unsafe outputs from ambiguous jailbreak prompts AND unnecessary refusals of benign but confusing requests

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

inference_timeblack_box

Datasets

SafetyInstructXSTestWildJailbreak

Applications

llm safetychatbotcontent moderation

Read PDF arXiv DOI

Learning to Extract Context for Context-Aware LLM Inference

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Safety Alignment of LMs via Non-cooperative Games

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Auto-Tuning Safety Guardrails for Black-Box Large Language Models

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs