No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
Joonsung Jeon , Woo Jae Kim , Suhyeon Ha , Sooel Son , Sung-Eui Yoon
Published on arXiv
2602.22689
Membership Inference Attack
OWASP ML Top 10 — ML04
Key Finding
MoFit consistently outperforms prior VLM-conditioned MIA baselines and achieves performance competitive with caption-dependent methods across multiple datasets without access to ground-truth captions.
MoFit
Novel technique introduced
Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
Key Contributions
- Caption-free MIA framework (MoFit) that eliminates the unrealistic assumption of ground-truth text caption access for membership inference against text-to-image diffusion models
- Model-fitted surrogate optimization: a two-stage pipeline that perturbs query images into regions of the model's unconditional prior to construct conditioning embeddings overfitted to the target model's generative manifold
- Demonstrated that MoFit outperforms VLM-conditioned baselines and achieves performance competitive with caption-dependent MIA methods across multiple datasets and diffusion models
🛡️ Threat Analysis
Core contribution is a membership inference attack that determines whether a given image was included in a diffusion model's training set — the canonical binary 'was this in training?' threat.