ML Security Papers

Latest papers

18 papers

defense arXiv Mar 19, 2026 · 18d ago

Functional Subspace Watermarking for Large Language Models

Zikang Ding, Junhao Li, Suling Wu et al. · University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence +1 more

Embeds ownership watermarks in a low-dimensional functional subspace of LLM weights, surviving fine-tuning, quantization, and distillation attacks

Model Theft Model Theft nlp

PDF

defense arXiv Mar 12, 2026 · 25d ago

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang, Yijiang Li, Ningzhi Gao et al. · South China University of Technology · University of California San Diego +1 more

Zero-shot inference-time backdoor detector for vision encoders using progressive masking and embedding trajectory clustering

Model Poisoning visionmultimodal

PDF

defense arXiv Feb 24, 2026 · 5w ago

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Haonan An, Xiaohui Ye, Guang Hua et al. · South China University of Technology · Singapore Institute of Technology +1 more

Embeds face content as background watermark to robustly detect, localize, and recover manipulated face regions against removal attacks

Output Integrity Attack visiongenerative

PDF

The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.

gan diffusion cnn South China University of Technology · Singapore Institute of Technology · City University of Hong Kong

PDF arXiv DOI

defense arXiv Feb 5, 2026 · 8w ago

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang et al. · South China University of Technology · Pengcheng Laboratory +1 more

Defends LLM safety alignment during fine-tuning by regularizing attention sink divergence to prevent harmful pattern learning

Transfer Learning Attack nlp

PDF Code

defense arXiv Feb 3, 2026 · 8w ago

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu et al. · South China University of Technology · Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) +2 more

Training-free VLM defense that amplifies risk signals in visual tokens to block multimodal jailbreak attacks without utility loss

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

defense arXiv Dec 5, 2025 · Dec 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang et al. · South China University of Technology · Hong Kong University of Science and Technology +2 more

Defends MLLMs against multimodal indirect prompt injection by steering instruction-following behavior in activation space

Prompt Injection multimodalnlp

1 citations PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Hai Ci, Ziheng Peng, Pei Yang et al. · National University of Singapore · South China University of Technology

Benchmark dataset of 30k diffusion-edited images with pixel-level annotations for localizing AI edits via semantic segmentation

Output Integrity Attack visiongenerative

PDF Code

attack arXiv Nov 20, 2025 · Nov 2025

Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu et al. · South China University of Technology · Tsinghua University

Sequence-level adversarial clothing textures with UV parameterization and temporal EoT that physically evade human detection in video

Input Manipulation Attack vision

PDF

benchmark arXiv Nov 18, 2025 · Nov 2025

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin, Jirui Yang, Yukui Qiu et al. · University of California · Fudan University +1 more

Proposes latent-trajectory metric to benchmark LLM jailbreak robustness without text generation, matching red-teaming rankings at under 1% compute cost

Prompt Injection nlp

PDF

attack Pattern Recognition Nov 3, 2025 · Nov 2025

Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability

Zhixuan Zhang, Pingyu Wang, Xingjian Zheng et al. · Sichuan University · Frost Drill Intellectual Software Pte. Ltd +1 more

Black-box transferable adversarial attack using dual-order flatness to escape deceptive loss regions and boost cross-model transferability

Input Manipulation Attack vision

PDF

defense arXiv Oct 18, 2025 · Oct 2025

EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Haoran Sun, Chen Cai, Huiping Zhuang et al. · The Hong Kong Polytechnic University · Nanyang Technological University +1 more

Explainable deepfake video detector using multimodal LLaMA with spatio-temporal chain-of-thought reasoning and facial hard constraints

Output Integrity Attack visionmultimodalnlp

PDF Code

The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The project page is available at: https://11ouo1.github.io/edvd-llama/.

vlm llm transformer The Hong Kong Polytechnic University · Nanyang Technological University · South China University of Technology

PDF arXiv DOI Code

attack arXiv Oct 18, 2025 · Oct 2025

Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

Guo Li, Yuyang Yu, Xuemiao Xu · South China University of Technology

Membership inference attack on diffusion models exploiting noise aggregation patterns after small-noise injection, requiring fewer model queries

Membership Inference Attack visiongenerative

PDF

defense arXiv Oct 11, 2025 · Oct 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Guozhi Liu, Qi Mu, Tiansheng Huang et al. · South China University of Technology · Ltd. +4 more

Curates safety-critical alignment data subsets to harden LLMs against harmful fine-tuning attacks while cutting training time by ~57%

Transfer Learning Attack Prompt Injection nlp

2 citations 1 influentialPDF Code

defense arXiv Oct 9, 2025 · Oct 2025

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang et al. · South China University of Technology · Pazhou Lab +4 more

Detects AI-generated videos via physics-driven NSG statistic quantifying violations of probability flow conservation laws

Output Integrity Attack visiongenerative

6 citations 1 influentialPDF Code

tool arXiv Oct 3, 2025 · Oct 2025

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

Qing Huang, Zhipei Xu, Xuanyu Zhang et al. · Peking University · South China University of Technology

Multi-agent system that unifies deepfake, AI-image, and manipulation detection by dynamically routing to expert detectors

Output Integrity Attack visionmultimodal

PDF

defense arXiv Sep 27, 2025 · Sep 2025

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

Yu Zhang, Shuliang Liu, Xu Yang et al. · The Hong Kong University of Science and Technology (Guangzhou) · South China University of Technology

Proposes dynamic LLM text watermarking using context-aware entropy thresholds to preserve quality across mixed-modality generation tasks

Output Integrity Attack nlp

1 citations PDF

defense arXiv Sep 22, 2025 · Sep 2025

StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models

Haoxin Yang, Bangzhen Liu, Xuemiao Xu et al. · South China University of Technology · Singapore Management University +1 more

Embeds binary watermarks into diffusion model outputs for copyright protection and tampered-region localization via end-to-end VAE-forensic network co-training

Output Integrity Attack visiongenerative

1 citations PDF

attack arXiv Aug 1, 2025 · Aug 2025

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng et al. · Beihang University · The Hong Kong University of Science and Technology +3 more

Two-stage LLM jailbreak uses hidden-state activations to guide text-level edits, bypassing safety alignment with SOTA attack success rates

Prompt Injection nlp

PDF Code

Latest papers

Functional Subspace Watermarking for Large Language Models

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability

EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models

Activation-Guided Local Editing for Jailbreaking Attacks

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue