benchmark arXiv Aug 12, 2025 · Aug 2025
Jungwoo Kim, Jong-Seok Lee · Yonsei University
Discovers that adversarial examples from earlier continual-learning stages transfer effectively to later-stage models, exposing a new black-box attack vector in Class-IL
Input Manipulation Attack vision
Class-incremental continual learning addresses catastrophic forgetting by enabling classification models to preserve knowledge of previously learned classes while acquiring new ones. However, the vulnerability of the models against adversarial attacks during this process has not been investigated sufficiently. In this paper, we present the first exploration of vulnerability to stage-transferred attacks, i.e., an adversarial example generated using the model in an earlier stage is used to attack the model in a later stage. Our findings reveal that continual learning methods are highly susceptible to these attacks, raising a serious security issue. We explain this phenomenon through model similarity between stages and gradual robustness degradation. Additionally, we find that existing adversarial training-based defense methods are not sufficiently effective to stage-transferred attacks. Codes are available at https://github.com/mcml-official/CSAT.
cnn Yonsei University
defense arXiv Jan 9, 2025 · Jan 2025
Junha Park, Jaehui Hwang, Ian Ryu et al. · Yonsei University
Defends Stable Diffusion safety checkers from adversarial bypass attacks by probing robustness to small text/latent perturbations
Input Manipulation Attack visiongenerative
With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose SC-Pro (Spherical or Circular Probing), a training-free framework that easily defends against adversarial attacks generating NSFW images. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (SC-Pro-o), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.
diffusion transformer Yonsei University