Monotonicity as an Architectural Bias for Robust Language Models
Patrick Cooper , Alireza Nadali , Ashutosh Trivedi , Alvaro Velasquez
Published on arXiv
2602.02686
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Selectively enforcing monotonicity in feed-forward sublayers reduces adversarial attack success rates from ~69% to ~19% while preserving summarization quality on pretrained Transformer models
Monotone Language Models
Novel technique introduced
Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.
Key Contributions
- Proposes selective monotonicity enforcement in Transformer feed-forward sublayers as an architectural inductive bias for robustness, while leaving attention mechanisms unconstrained to preserve expressivity
- Demonstrates that monotone language models reduce adversarial attack success rates from ~69% to ~19% with only marginal degradation in standard summarization performance
- Shows that the monotone/attention separation allows negation and contextual reasoning through attention while guaranteeing order-preserving semantic refinement in FFN layers
🛡️ Threat Analysis
The paper's core technical contribution is a defense against adversarial input perturbations — constraining feed-forward sublayers to be order-preserving reduces adversarial attack success rates from ~69% to ~19%, directly addressing inference-time input manipulation attacks on language models.