defense 2026

Learning Better Certified Models from Empirically-Robust Teachers

Alessandro De Palma 1,2

0 citations · 77 references · arXiv (Cornell University)

α

Published on arXiv

2602.02626

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Distillation from adversarially-trained teachers consistently improves state-of-the-art certified robustness for ReLU networks on robust computer vision benchmarks

Feature-space distillation for certified training

Novel technique introduced


Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.


Key Contributions

  • Proposes leveraging empirically-robust (adversarially-trained) teacher networks to improve certifiably-robust student models via knowledge distillation
  • Introduces a versatile feature-space distillation objective compatible with certified training losses for ReLU networks
  • Demonstrates consistent state-of-the-art improvements in the certified robustness vs. standard accuracy trade-off across multiple computer vision benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

Directly targets defense against adversarial input manipulation — specifically improving certified training (formal verification bounds guaranteeing robustness to adversarial perturbations) through knowledge distillation from adversarially-trained teachers.


Details

Domains
vision
Model Types
cnn
Threat Tags
white_boxdigitalinference_time
Applications
image classification