ML Security Papers

Latest papers

2 papers

benchmark arXiv Oct 11, 2025 · Oct 2025

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Zonghao Ying, Yangguang Shao, Jianle Gan et al. · Beihang University · Chinese Academy of Sciences +7 more

Benchmark evaluating LVLM web agent security across six attack vectors in realistic web environments, exposing universal vulnerabilities across 9 models

Prompt Injection Excessive Agency multimodalnlp

5 citations PDF

defense arXiv Aug 2, 2025 · Aug 2025

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Zonglei Jing, Xiao Yang, Xiaoqian Li et al. · Beihang University · Beijing University of Posts and Telecommunications +3 more

Gated soft prompt tuning defense for T2I diffusion models that adaptively suppresses NSFW generation based on estimated prompt toxicity

Prompt Injection generative

PDF

Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

diffusion transformer Beihang University · Beijing University of Posts and Telecommunications · Taishan University +2 more

PDF arXiv

Latest papers

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue