defense arXiv Aug 12, 2025 · Aug 2025
Yutong Wu, Jie Zhang, Yiming Li et al. · Nanyang Technological University · Technology and Research +2 more
Proposes Cowpox, a distributed cure-sample defense immunizing VLM multi-agent systems against propagating jailbreak infections
Prompt Injection Excessive Agency multimodalnlp
Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system's assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.
vlm multimodal llm Nanyang Technological University · Technology and Research · Tsinghua University +1 more
attack arXiv Aug 28, 2025 · Aug 2025
Fahad Shamshad, Tameem Bakr, Yahia Shaaban et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University
Wins NeurIPS 2024 watermark removal challenge via adaptive VAE evasion and diffusion purification, achieving 95.7% removal rate
Output Integrity Attack visiongenerative
Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image's quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image's quality. We hope that our attacks inspire the development of more robust image watermarking methods.
diffusion Mohamed bin Zayed University of Artificial Intelligence · Michigan State University