defense 2025

Abstract Gradient Training: A Unified Certification Framework for Data Poisoning, Unlearning, and Differential Privacy

Philip Sosnin 1,2, Matthew Wicker 1,2, Josh Collyer 2, Calvin Tsay 1,2

2 citations · 1 influential · arXiv

α

Published on arXiv

2511.09400

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

AGT provides provable parameter-space bounds over training runs, yielding formal certificates of robustness against adversarial data poisoning, certified data removal for unlearning, and differential privacy guarantees within a single framework.

Abstract Gradient Training (AGT)

Novel technique introduced


The impact of inference-time data perturbation (e.g., adversarial attacks) has been extensively studied in machine learning, leading to well-established certification techniques for adversarial robustness. In contrast, certifying models against training data perturbations remains a relatively under-explored area. These perturbations can arise in three critical contexts: adversarial data poisoning, where an adversary manipulates training samples to corrupt model performance; machine unlearning, which requires certifying model behavior under the removal of specific training data; and differential privacy, where guarantees must be given with respect to substituting individual data points. This work introduces Abstract Gradient Training (AGT), a unified framework for certifying robustness of a given model and training procedure to training data perturbations, including bounded perturbations, the removal of data points, and the addition of new samples. By bounding the reachable set of parameters, i.e., establishing provable parameter-space bounds, AGT provides a formal approach to analyzing the behavior of models trained via first-order optimization methods.


Key Contributions

  • Introduces Abstract Gradient Training (AGT), a unified framework that certifies model robustness to training data perturbations by bounding the reachable set of model parameters (provable parameter-space bounds)
  • Unifies three distinct training-time perturbation scenarios — adversarial data poisoning, machine unlearning, and differential privacy — under a single formal certification approach
  • Applies formal verification techniques from adversarial robustness certification (IBP, CROWN) to the training process itself via first-order optimization

🛡️ Threat Analysis

Data Poisoning Attack

The paper explicitly addresses adversarial data poisoning — where an adversary manipulates training samples to corrupt model performance — and provides provable certification bounds against it. This is the primary security threat in the paper's threat model; the unlearning and DP components are framed as additional certification contexts within the same unified framework, not as independent adversarial threats.


Details

Domains
vision
Model Types
cnntraditional_ml
Threat Tags
training_timewhite_box
Applications
image classificationgeneral supervised learning