defense 2026

WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Hsin-Ling Hsu , Min-Yu Chen , Nai-Chia Chen , Yan-Ru Chen , Yi-Ling Chang , Fang Yu

0 citations

α

Published on arXiv

2604.00938

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves 100% repair and remain accuracy while outperforming gradient-based baselines on attack generalization by up to 18.8 percentage points

WARP

Novel technique introduced


Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.


Key Contributions

  • WARP framework that extends provable repair beyond the final layer to inner layers of Transformer models via convex quadratic programming
  • Three per-sample guarantees: positive margin on repaired inputs, preservation of remain set accuracy, and certified robustness radius via Lipschitz continuity
  • Gap Sensitivity Norm (GSN) preprocessing step to ensure QP feasibility across different model architectures

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial perturbations on NLP inputs that cause misclassification. The paper explicitly addresses adversarial examples (TextFooler, TextBugger) and repairs models to be robust against such attacks, providing certified robustness radius guarantees.


Details

Domains
nlp
Model Types
transformer
Threat Tags
inference_timedigital
Applications
text classificationnlp