Monotonicity as an Architectural Bias for Robust Language Models

Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.

Key Contributions

Proposes selective monotonicity enforcement in Transformer feed-forward sublayers as an architectural inductive bias for robustness, while leaving attention mechanisms unconstrained to preserve expressivity
Demonstrates that monotone language models reduce adversarial attack success rates from ~69% to ~19% with only marginal degradation in standard summarization performance
Shows that the monotone/attention separation allows negation and contextual reasoning through attention while guaranteeing order-preserving semantic refinement in FFN layers

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core technical contribution is a defense against adversarial input perturbations — constraining feed-forward sublayers to be order-preserving reduces adversarial attack success rates from ~69% to ~19%, directly addressing inference-time input manipulation attacks on language models.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_timedigital

Applications

2026 0 cit.

Input Manipulation Attack

88%

Monotonicity as an Architectural Bias for Robust Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Inverse Language Modeling towards Robust and Grounded LLMs

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs