PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
Mingzhe Li 1, Renhao Zhang 1, Zhiyang Wen 1, Siqi Pan 2, Bruno Castro da Silva 1, Juan Zhai 1, Shiqing Ma 1
Published on arXiv
2511.22119
Model Theft
OWASP ML Top 10 — ML05
Key Finding
PROMPTMINER achieves CLIP similarity up to 0.958 and outperforms the strongest baseline by 7.5% on in-the-wild images, with no white-box access to the generative model.
PROMPTMINER
Novel technique introduced
Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner
Key Contributions
- Two-phase black-box prompt stealing framework (PROMPTMINER) that decouples subject reconstruction via RL from stylistic modifier recovery via fuzzing, requiring no gradient access or large labeled datasets
- Achieves CLIP similarity up to 0.958 and SBERT alignment up to 0.751, surpassing all baselines across multiple diffusion backbones
- Demonstrates robustness against defensive perturbations and 7.5% CLIP improvement over the strongest baseline on in-the-wild images with unknown generators
🛡️ Threat Analysis
The paper proposes 'prompt stealing' — unauthorized extraction and reuse of carefully engineered prompts that are explicitly framed as 'valuable digital assets' and intellectual property associated with T2I generative systems. This parallels model extraction attacks (ML05) in threat model: a black-box adversary queries the system (here, observes images) to steal its valuable IP (prompts rather than weights). The paper also mentions 'model provenance analysis' as a key use case, further aligning with ML05's IP-protection concerns.