attack 2026

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li 1,2, Cheng-Long Wang 1, Yinzhi Cao 3, Di Wang 1

0 citations

α

Published on arXiv

2604.12342

Membership Inference Attack

OWASP ML Top 10 — ML04

Key Finding

Subset training leaks both training membership (TM-MIA) and selection participation (SP-MIA), extending privacy risks across the ML data-model supply chain

CoLA

Novel technique introduced


Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary's knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.


Key Contributions

  • Introduces SP-MIA (Selection-participation MIA), expanding membership inference from training data to all data involved in subset selection process
  • Proposes CoLA framework with two attack scenarios: subset-aware side-channel attacks and black-box attacks
  • Demonstrates that subset training increases privacy risk compared to full-dataset training, challenging common assumptions

🛡️ Threat Analysis

Membership Inference Attack

Primary contribution is two novel membership inference attack variants (TM-MIA and SP-MIA) that determine if specific data points were used in training or participated in subset selection process.


Details

Domains
visionnlp
Model Types
cnntransformer
Threat Tags
black_boxtraining_timeinference_time
Datasets
CIFAR-10CIFAR-100ImageNet
Applications
coreset selectiondata filteringsubset training