defense 2026

Monotonicity as an Architectural Bias for Robust Language Models

Patrick Cooper , Alireza Nadali , Ashutosh Trivedi , Alvaro Velasquez

0 citations · 31 references · arXiv (Cornell University)

α

Published on arXiv

2602.02686

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Selectively enforcing monotonicity in feed-forward sublayers reduces adversarial attack success rates from ~69% to ~19% while preserving summarization quality on pretrained Transformer models

Monotone Language Models

Novel technique introduced


Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.


Key Contributions

  • Proposes selective monotonicity enforcement in Transformer feed-forward sublayers as an architectural inductive bias for robustness, while leaving attention mechanisms unconstrained to preserve expressivity
  • Demonstrates that monotone language models reduce adversarial attack success rates from ~69% to ~19% with only marginal degradation in standard summarization performance
  • Shows that the monotone/attention separation allows negation and contextual reasoning through attention while guaranteeing order-preserving semantic refinement in FFN layers

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core technical contribution is a defense against adversarial input perturbations — constraining feed-forward sublayers to be order-preserving reduces adversarial attack success rates from ~69% to ~19%, directly addressing inference-time input manipulation attacks on language models.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_timedigital
Applications
language model safetyjailbreak defensetext summarization