attack 2025

PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

Mingzhe Li ¹, Renhao Zhang ¹, Zhiyang Wen ¹, Siqi Pan ², Bruno Castro da Silva ¹, Juan Zhai ¹, Shiqing Ma ¹

¹ University of Massachusetts, Amherst

² Dolby Laboratories

0 citations · 41 references · arXiv

Published on arXiv

2511.22119

Model Theft

OWASP ML Top 10 — ML05

Key Finding

PROMPTMINER achieves CLIP similarity up to 0.958 and outperforms the strongest baseline by 7.5% on in-the-wild images, with no white-box access to the generative model.

PROMPTMINER

Novel technique introduced

Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner

Key Contributions

Two-phase black-box prompt stealing framework (PROMPTMINER) that decouples subject reconstruction via RL from stylistic modifier recovery via fuzzing, requiring no gradient access or large labeled datasets
Achieves CLIP similarity up to 0.958 and SBERT alignment up to 0.751, surpassing all baselines across multiple diffusion backbones
Demonstrates robustness against defensive perturbations and 7.5% CLIP improvement over the strongest baseline on in-the-wild images with unknown generators

🛡️ Threat Analysis

Model Theft

The paper proposes 'prompt stealing' — unauthorized extraction and reuse of carefully engineered prompts that are explicitly framed as 'valuable digital assets' and intellectual property associated with T2I generative systems. This parallels model extraction attacks (ML05) in threat model: a black-box adversary queries the system (here, observes images) to steal its valuable IP (prompts rather than weights). The paper also mentions 'model provenance analysis' as a key use case, further aligning with ML05's IP-protection concerns.

Details

Domains

visionnlpgenerative

Model Types

diffusion

Threat Tags

black_boxinference_time

Datasets

DiffusionDBin-the-wild images

Applications

text-to-image generationprompt engineeringmodel provenance analysis

Read PDF arXiv DOI Code

PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models

Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models

Lossless Copyright Protection via Intrinsic Model Fingerprinting

DiMEx: Breaking the Cold Start Barrier in Data-Free Model Extraction via Latent Diffusion Priors

Knowledge Distillation Detection for Open-weights Models

StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data

Rotation, Scale, and Translation Resilient Black-box Fingerprinting for Intellectual Property Protection of EaaS Models

Cryptanalysis of Pseudorandom Error-Correcting Codes