Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition

A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, δ)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least $>3\times$ stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.

Key Contributions

Introduces 'test-time privacy' (TTP) — a novel threat model where adversaries leverage open-weight model predictions on incorrect/corrupted personal data to cause individual harm
Proposes a Pareto-optimal fine-tuning algorithm that induces maximal output uncertainty on protected instances while preserving accuracy on all other instances
Provides (ε, δ)-certified approximation algorithms without convexity assumptions and a tight theoretical bound characterizing the privacy-utility tradeoff

🛡️ Threat Analysis

Output Integrity Attack

The paper's core concern is the integrity of model outputs on protected personal instances — adversaries use confident predictions from open-weight models to make harmful decisions about individuals. The defense ensures outputs on protected data are maximally uncertain (uniformly uninformative), directly targeting the actionability of model predictions. This is an output integrity concern, though the fit is imperfect as ML09 is more commonly associated with content provenance; no other OWASP category better captures this inference-time output manipulation defense.

Details

Domains

vision

Model Types

cnntransformer

Threat Tags

white_boxinference_time

Datasets

CIFAR-100HAM10000/ISIC

Applications

2025 0 cit.

Output Integrity Attack

90%