Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models

Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace.

Key Contributions

RLStealer: a PPO-based prompt inversion framework that decomposes prompt templates into Subject/Modifiers/Supplement components and uses image-similarity reward signals to efficiently search the discrete prompt space
Reduces total attack cost to under 13% of the evolutionary-algorithm baseline (EvoStealer) while achieving state-of-the-art prompt template recovery on the PRISM dataset
Demonstrates cross-style generalization to unseen prompt templates, exposing a critical and underexamined IP vulnerability in the emerging prompt trading ecosystem

🛡️ Threat Analysis

Model Theft

RLStealer's primary goal is IP theft: it clones proprietary prompt templates — commercially sold products whose value is their ability to reproduce a specific artistic style — by inferring them from model-generated output images. This directly parallels model extraction attacks (e.g., knockoff nets), where an adversary uses model outputs to recover and clone the valuable learned functionality, except here the 'stolen' IP is the engineered prompt rather than model weights. The attacker never accesses the diffusion model's internals; they use black-box output access to reconstruct the creator's IP.