defense arXiv Aug 4, 2025 · Aug 2025
Kongxin Wang, Jie Zhang, Peigui Qi et al. · University of Science and Technology of China · A*STAR +1 more
Embeds safety guardrails into pose-guided video diffusion models to suppress deepfakes, NSFW content, and impersonation at inference
Output Integrity Attack visiongenerative
Pose-guided video generation has become a powerful tool in creative industries, exemplified by frameworks like Animate Anyone. However, conditioning generation on specific poses introduces serious risks, such as impersonation, privacy violations, and NSFW content creation. To address these challenges, we propose $\textbf{PoseGuard}$, a safety alignment framework for pose-guided generation. PoseGuard is designed to suppress unsafe generations by degrading output quality when encountering malicious poses, while maintaining high-fidelity outputs for benign inputs. We categorize unsafe poses into three representative types: discriminatory gestures such as kneeling or offensive salutes, sexually suggestive poses that lead to NSFW content, and poses imitating copyrighted celebrity movements. PoseGuard employs a dual-objective training strategy combining generation fidelity with safety alignment, and uses LoRA-based fine-tuning for efficient, parameter-light updates. To ensure adaptability to evolving threats, PoseGuard supports pose-specific LoRA fusion, enabling flexible and modular updates when new unsafe poses are identified. We further demonstrate the generalizability of PoseGuard to facial landmark-guided generation. Extensive experiments validate that PoseGuard effectively blocks unsafe generations, maintains generation quality for benign inputs, and remains robust against slight pose variations.
diffusion University of Science and Technology of China · A*STAR · Nanyang Technological University
defense arXiv Aug 28, 2025 · Aug 2025
Weitao Feng, Lixu Wang, Tianyi Wei et al. · Nanyang Technological University · A*STAR +1 more
Defends LLM safety alignment against RL fine-tuning attacks by suppressing response entropy via TokenBuncher
Transfer Learning Attack Prompt Injection nlpreinforcement-learning
As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.
llm rl Nanyang Technological University · A*STAR · Northwestern University