attack 2026

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

0 citations · 38 references · arXiv (Cornell University)

Published on arXiv

2602.08874

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Stronger reasoning capability does not improve LLM safety under compositional attacks; safety alignment degrades with context length, but increasing inference-time compute reduces attack success rate by over 50 percentage points.

Compositional Reasoning Attacks

Novel technique introduced

Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.

Key Contributions

Introduces compositional reasoning attacks: a single-pass, long-context threat model where a harmful query is decomposed into incomplete fragments scattered throughout the context, with harmful intent emerging only after model synthesis
Empirically demonstrates across 14 frontier LLMs that stronger general reasoning capability does not improve robustness to this attack, and that safety alignment consistently degrades as context length increases up to 64k tokens
Identifies inference-time compute as a key mitigating factor, with increased reasoning effort reducing attack success by over 50 percentage points on GPT-oss-120b

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBench

Applications

large language modelsretrieval-augmented generation

Read PDF arXiv DOI

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Persona Jailbreaking in Large Language Models

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Emoji-Based Jailbreaking of Large Language Models

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs