attack 2026

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Atsuki Sato ¹, Martin Aumüller ², Yusuke Matsui ¹

¹ The University of Tokyo

² IT University of Copenhagen

0 citations

Published on arXiv

2603.00537

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

The greedy multi-point poisoning attack achieves at least 93.3% of optimal MSE degradation, while the new Heuristic Seg+E method achieves at least 99.8% of optimal across 3,000 test cases.

Seg+E (Segmented Exact/Heuristic Poisoning)

Novel technique introduced

Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.

Key Contributions

Theoretical proof that the existing single-point poisoning method is optimal for linear regression over CDFs
Rigorous derivation showing greedy multi-point attacks are not always optimal, with characterization of properties an optimal multi-point attack must satisfy
New exact (Seg+E) and heuristic attack algorithms that achieve optimal or near-optimal (≥0.998) multi-point attack performance, along with a computable upper bound on maximum poisoning impact

🛡️ Threat Analysis

Data Poisoning Attack

The paper directly analyzes data poisoning attacks on training data for linear regression models used in learned indexes, characterizes optimal single-point and multi-point poison key injection strategies, and derives theoretical upper bounds on poisoning impact — a core ML02 contribution.

Details

Domains

tabular

Model Types

traditional_ml

Threat Tags

training_timewhite_boxtargeted

Applications

learned index structuresdatabase indexingcdf regression models

Read PDF arXiv

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Adversarial Bias: Data Poisoning Attacks on Fairness

Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles

IndirectAD: Practical Data Poisoning Attacks against Recommender Systems for Item Promotion

On Robustness of Linear Classifiers to Targeted Data Poisoning

Fairness-Constrained Optimization Attack in Federated Learning

Quality Degradation Attack in Synthetic Data

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding