Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition
Muhammad H. Ashiq 1, Peter Triantafillou 2, Hung Yun Tseng 1, Grigoris G. Chrysos 1
Published on arXiv
2509.11625
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves >3× stronger output uncertainty on protected instances compared to pretraining baseline with only marginal accuracy drops on unprotected instances across image recognition benchmarks
Test-Time Privacy (TTP) with Pareto Optimal Uncertainty Induction
Novel technique introduced
A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, δ)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least $>3\times$ stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.
Key Contributions
- Introduces 'test-time privacy' (TTP) — a novel threat model where adversaries leverage open-weight model predictions on incorrect/corrupted personal data to cause individual harm
- Proposes a Pareto-optimal fine-tuning algorithm that induces maximal output uncertainty on protected instances while preserving accuracy on all other instances
- Provides (ε, δ)-certified approximation algorithms without convexity assumptions and a tight theoretical bound characterizing the privacy-utility tradeoff
🛡️ Threat Analysis
The paper's core concern is the integrity of model outputs on protected personal instances — adversaries use confident predictions from open-weight models to make harmful decisions about individuals. The defense ensures outputs on protected data are maximally uncertain (uniformly uninformative), directly targeting the actionability of model predictions. This is an output integrity concern, though the fit is imperfect as ML09 is more commonly associated with content provenance; no other OWASP category better captures this inference-time output manipulation defense.