defense arXiv Aug 30, 2025 · Aug 2025
Aditya Kasliwal, Franziska Boenisch, Adam Dziedzic · CISPA Helmholtz Center for Information Security
Localizes memorization hotspots in image autoregressive models and intervenes to reduce training data extraction with minimal quality loss
Model Inversion Attack visiongenerative
Image AutoRegressive (IAR) models have achieved state-of-the-art performance in speed and quality of generated images. However, they also raise concerns about memorization of their training data and its implications for privacy. This work explores where and how such memorization occurs within different image autoregressive architectures by measuring a fine-grained memorization. The analysis reveals that memorization patterns differ across various architectures of IARs. In hierarchical per-resolution architectures, it tends to emerge early and deepen with resolutions, while in IARs with standard autoregressive per token prediction, it concentrates in later processing stages. These localization of memorization patterns are further connected to IARs' ability to memorize and leak training data. By intervening on their most memorizing components, we significantly reduce the capacity for data extraction from IARs with minimal impact on the quality of generated images. These findings offer new insights into the internal behavior of image generative models and point toward practical strategies for mitigating privacy risks.
transformer CISPA Helmholtz Center for Information Security
defense arXiv Aug 6, 2025 · Aug 2025
Iyiola E. Olatunji, Franziska Boenisch, Jing Xu et al. · University of Luxembourg · CISPA Helmholtz Center for Information Security
Attacks graph-aware LLMs via poisoning, evasion, and template injection; proposes GALGUARD combining feature correction and GNN defenses
Input Manipulation Attack Data Poisoning Attack Prompt Injection graphnlp
Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks.
llm gnn transformer University of Luxembourg · CISPA Helmholtz Center for Information Security
attack arXiv Feb 28, 2026 · 5w ago
Dariush Wahdany, Matthew Jagielski, Adam Dziedzic et al. · CISPA Helmholtz Center for Information Security · Anthropic
Membership inference attacks expose private data leakage in curation pipelines even when models train only on public data
Membership Inference Attack vision
In machine learning, curation is used to select the most valuable data for improving both model accuracy and computational efficiency. Recently, curation has also been explored as a solution for private machine learning: rather than training directly on sensitive data, which is known to leak information through model predictions, the private data is used only to guide the selection of useful public data. The resulting model is then trained solely on curated public data. It is tempting to assume that such a model is privacy-preserving because it has never seen the private data. Yet, we show that without further protection, curation pipelines can still leak private information. Specifically, we introduce novel attacks against popular curation methods, targeting every major step: the computation of curation scores, the selection of the curated subset, and the final trained model. We demonstrate that each stage reveals information about the private dataset and that even models trained exclusively on curated public data leak membership information about the private data that guided curation. These findings highlight the previously overlooked inherent privacy risks of data curation and show that privacy assessment must extend beyond the training procedure to include the data selection process. Our differentially private adaptations of curation methods effectively mitigate leakage, indicating that formal privacy guarantees for curation are a promising direction.
cnn transformer CISPA Helmholtz Center for Information Security · Anthropic
defense arXiv Mar 3, 2026 · 4w ago
Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more
Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality
Prompt Injection visionmultimodalgenerative
Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.
diffusion transformer NASK National Research Institute · Warsaw University of Technology · Tooplox +2 more
benchmark arXiv Aug 16, 2025 · Aug 2025
Jimmy Z. Di, Yiwei Lu, Yaoliang Yu et al. · University of Waterloo · Vector Institute +2 more
Proposes FB-Mem segmentation metric to quantify partial training data memorization in diffusion models, showing current mitigations fail for foreground regions
Model Inversion Attack visiongenerative
Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization (FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.
diffusion University of Waterloo · Vector Institute · University of Ottawa +1 more