Localizing and Mitigating Memorization in Image Autoregressive Models
Aditya Kasliwal , Franziska Boenisch , Adam Dziedzic
Published on arXiv
2509.00488
Model Inversion Attack
OWASP ML Top 10 — ML03
Key Finding
Intervening on the most memorizing components of IAR models significantly reduces training data extraction capacity with minimal impact on generated image quality.
Image AutoRegressive (IAR) models have achieved state-of-the-art performance in speed and quality of generated images. However, they also raise concerns about memorization of their training data and its implications for privacy. This work explores where and how such memorization occurs within different image autoregressive architectures by measuring a fine-grained memorization. The analysis reveals that memorization patterns differ across various architectures of IARs. In hierarchical per-resolution architectures, it tends to emerge early and deepen with resolutions, while in IARs with standard autoregressive per token prediction, it concentrates in later processing stages. These localization of memorization patterns are further connected to IARs' ability to memorize and leak training data. By intervening on their most memorizing components, we significantly reduce the capacity for data extraction from IARs with minimal impact on the quality of generated images. These findings offer new insights into the internal behavior of image generative models and point toward practical strategies for mitigating privacy risks.
Key Contributions
- Fine-grained localization of memorization patterns across different IAR architectures (hierarchical per-resolution vs. per-token autoregressive), revealing architecture-specific memorization dynamics
- Connection between localized memorization and the model's capacity to leak training data through extraction attacks
- Component-level intervention strategy that significantly reduces data extraction capacity while preserving generated image quality
🛡️ Threat Analysis
The paper explicitly frames memorization as enabling adversarial training data extraction from IAR models. It measures fine-grained memorization, connects it to data leakage capacity, and proposes interventions to reduce an adversary's ability to extract training data — a direct fit for the model inversion / training data reconstruction threat.