defense 2025

ObCLIP: Oblivious CLoud-Device Hybrid Image Generation with Privacy Preservation

Haoqi Wu 1, Wei Dai 1, Ming Xu 2, Li Wang 1, Qiang Yan 1

0 citations · 47 references · arXiv

α

Published on arXiv

2510.04153

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

ObCLIP provides rigorous prompt privacy against embedding inversion attacks while achieving comparable image quality to cloud-only models with only slightly increased server cost

ObCLIP

Novel technique introduced


Diffusion Models have gained significant popularity due to their remarkable capabilities in image generation, albeit at the cost of intensive computation requirement. Meanwhile, despite their widespread deployment in inference services such as Midjourney, concerns about the potential leakage of sensitive information in uploaded user prompts have arisen. Existing solutions either lack rigorous privacy guarantees or fail to strike an effective balance between utility and efficiency. To bridge this gap, we propose ObCLIP, a plug-and-play safeguard that enables oblivious cloud-device hybrid generation. By oblivious, each input prompt is transformed into a set of semantically similar candidate prompts that differ only in sensitive attributes (e.g., gender, ethnicity). The cloud server processes all candidate prompts without knowing which one is the real one, thus preventing any prompt leakage. To mitigate server cost, only a small portion of denoising steps is performed upon the large cloud model. The intermediate latents are then sent back to the client, which selects the targeted latent and completes the remaining denoising using a small device model. Additionally, we analyze and incorporate several cache-based accelerations that leverage temporal and batch redundancy, effectively reducing computation cost with minimal utility degradation. Extensive experiments across multiple datasets demonstrate that ObCLIP provides rigorous privacy and comparable utility to cloud models with slightly increased server cost.


Key Contributions

  • Oblivious transformation: each user prompt is expanded into semantically similar candidate prompts differing only in sensitive attributes (gender, ethnicity), ensuring the cloud server cannot identify the real prompt
  • Hybrid cloud-device generation pipeline where the cloud performs only partial denoising steps, and the client selects the correct latent and completes denoising on-device with a small model
  • Cache-based acceleration exploiting temporal and batch redundancy to reduce server overhead from processing multiple candidate prompts

🛡️ Threat Analysis

Model Inversion Attack

The paper explicitly defends against embedding inversion attacks (citing text-embedding-inversion-23 and embedding-attack-sp20) where a cloud server could recover the original sensitive text prompt from CLIP embeddings. ObCLIP's oblivious transformation is designed to prevent this reconstruction by ensuring the server never receives the true embedding in isolation. Embedding inversion is a listed ML03 threat vector.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
black_boxinference_time
Applications
text-to-image generationcloud inference services