defense 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng ¹, Vidhisha Balachandran ², Chan Young Park ², Faeze Brahman ³, Sachin Kumar ¹

¹ The Ohio State University

² Microsoft Research

³ Allen Institute for AI

1 citations · 56 references · arXiv

Published on arXiv

2511.04694

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

RL fine-tuning on VerIH achieves ~20% improvement on the IHEval conflict setup and up to a 20% reduction in jailbreak and prompt injection attack success rates, generalizing beyond the training distribution.

VerIH

Novel technique introduced

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

Key Contributions

Reframes instruction hierarchy resolution as a reasoning task, where the model must 'think' about the relationship between user prompts and system-level instructions before responding.
Constructs VerIH, a ~7K-sample dataset of aligned and conflicting system–user instructions with verifiable answers used for lightweight RL fine-tuning.
Demonstrates that reasoning over instruction hierarchies generalizes to out-of-distribution safety settings, achieving ~20% improvement on IHEval conflict benchmark and up to 20% reduction in jailbreak/prompt injection ASR.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Datasets

VerIHIHEvaljailbreak benchmarks

Applications

conversational aillm-based agentssafety-critical llm deployment

Read PDF arXiv DOI Code

Reasoning Up the Instruction Ladder for Controllable Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Don't Walk the Line: Boundary Guidance for Filtered Generation