defense 2025

Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition

Muhammad H. Ashiq 1, Peter Triantafillou 2, Hung Yun Tseng 1, Grigoris G. Chrysos 1

0 citations

α

Published on arXiv

2509.11625

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves >3× stronger output uncertainty on protected instances compared to pretraining baseline with only marginal accuracy drops on unprotected instances across image recognition benchmarks

Test-Time Privacy (TTP) with Pareto Optimal Uncertainty Induction

Novel technique introduced


A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, δ)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least $>3\times$ stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.


Key Contributions

  • Introduces 'test-time privacy' (TTP) — a novel threat model where adversaries leverage open-weight model predictions on incorrect/corrupted personal data to cause individual harm
  • Proposes a Pareto-optimal fine-tuning algorithm that induces maximal output uncertainty on protected instances while preserving accuracy on all other instances
  • Provides (ε, δ)-certified approximation algorithms without convexity assumptions and a tight theoretical bound characterizing the privacy-utility tradeoff

🛡️ Threat Analysis

Output Integrity Attack

The paper's core concern is the integrity of model outputs on protected personal instances — adversaries use confident predictions from open-weight models to make harmful decisions about individuals. The defense ensures outputs on protected data are maximally uncertain (uniformly uninformative), directly targeting the actionability of model predictions. This is an output integrity concern, though the fit is imperfect as ML09 is more commonly associated with content provenance; no other OWASP category better captures this inference-time output manipulation defense.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
white_boxinference_time
Datasets
CIFAR-100HAM10000/ISIC
Applications
image recognitionmedical imagingskin disease classification