ML Security Papers

Latest papers

20 papers

attack arXiv Mar 31, 2026 · 8d ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin et al. · Tsinghua University · University of Macau +1 more

Novel adversarial attack targeting dummy-class defenses by simultaneously attacking true and dummy labels with adaptive weighting

Input Manipulation Attack vision

PDF

attack arXiv Mar 26, 2026 · 13d ago

A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

Jiaming Liang, Chi-Man Pun · University of Macau

Spatial transformation-based adversarial attacks on segmentation and detection models via synchronized label-input alignment

Input Manipulation Attack vision

PDF

defense arXiv Mar 26, 2026 · 13d ago

Efficient Preemptive Robustification with Image Sharpening

Jiaming Liang, Chi-Man Pun · University of Macau

Image sharpening as a simple, efficient pre-attack defense that robustifies benign images against adversarial perturbations before attacks occur

Input Manipulation Attack vision

PDF

defense arXiv Mar 25, 2026 · 14d ago

High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

Peipeng Yu, Jinfeng Xie, Chengfu Ou et al. · Jinan University · University of Macau +2 more

Embeds semantic watermarks in face images for copyright protection, pixel-level deepfake localization, and content recovery after manipulation

Output Integrity Attack visiongenerative

PDF

tool arXiv Mar 23, 2026 · 16d ago

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Zhilin Tu, Kemou Li, Fengpeng Li et al. · University of Electronic Science and Technology of China · University of Macau +2 more

Multi-expert ensemble detector for AI-generated images robust to degradations, using CLIP/SigLIP transformers with feature distillation

Output Integrity Attack visiongenerative

PDF

The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild'' conditions, offering an effective and practical solution for real-world deepfake image detection.

diffusion gan transformer University of Electronic Science and Technology of China · University of Macau · King Abdullah University of Science and Technology +1 more

PDF arXiv

defense arXiv Mar 2, 2026 · 5w ago

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Yuchen Zhang, Yaxiong Wang, Kecheng Han et al. · Xi’an Jiaotong University · Hefei University of Technology +3 more

Proposes REFORM, a forensic-reasoning framework with curriculum learning and RL to generalize multimodal deepfake detection

Output Integrity Attack multimodalvisionnlpgenerative

PDF

defense arXiv Feb 6, 2026 · 8w ago

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang et al. · University of Macau · King Abdullah University of Science and Technology +2 more

Defends diffusion model concept erasure against adversarial prompt reactivation attacks via semantic-center-targeting adversarial erasure targets and gradient projection

Input Manipulation Attack visiongenerative

PDF Code

defense arXiv Feb 4, 2026 · 9w ago

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative

PDF

defense arXiv Jan 28, 2026 · 10w ago

MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Wenbo Xu, Wei Lu, Xiangyang Luo et al. · Sun Yat-Sen University · State Key Laboratory of Mathematical Engineering and Advanced Computing +1 more

Proposes VLM-based deepfake detector using RLHF and multimodal alignment rewards for explainable forgery reasoning and spatial localization

Output Integrity Attack visionmultimodal

PDF

benchmark arXiv Jan 24, 2026 · 10w ago

OTI: A Model-free and Visually Interpretable Measure of Image Attackability

Jiaming Liang, Haowei Liu, Chi-Man Pun · University of Macau · Chongqing University of Posts and Telecommunications

Proposes OTI, a model-free texture-based metric for quantifying per-image adversarial vulnerability without model access

Input Manipulation Attack vision

PDF Code

attack arXiv Jan 17, 2026 · 11w ago

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li et al. · University of Macau · National Institute of Informatics +2 more

Hard-label black-box adversarial attack using frequency-domain initialization and pattern-driven optimization to recover gradient sign information

Input Manipulation Attack vision

PDF Code

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

cnn transformer University of Macau · National Institute of Informatics · Griffith University +1 more

PDF arXiv DOI Code

defense arXiv Jan 12, 2026 · 12w ago

Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion

Li Zheng, Liangbin Xie, Jiantao Zhou et al. · University of Macau · Shenzhen Institute of Advanced Technology

Defeats anti-fine-tuning image protections on Stable Diffusion by minimizing DDIM inversion reconstruction error to purify adversarial noise

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Jan 3, 2026 · Jan 2026

IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection

Jiajie Zhu, Xia Du, Xiaoyuan Liu et al. · Xiamen University of Technology · Sichuan University +2 more

Reversible adversarial audio perturbations fool ASR systems into wrong transcriptions while authorized parties recover the original audio losslessly

Input Manipulation Attack audio

PDF

defense arXiv Nov 20, 2025 · Nov 2025

How Noise Benefits AI-generated Image Detection

Jiazhen Yan, Ziqiang Li, Fan Wang et al. · Nanjing University of Information Science and Technology · University of Macau +1 more

Proposes PiN-CLIP, a noise-guided CLIP fine-tuning method that suppresses spurious shortcuts for generalizable AI-generated image detection

Output Integrity Attack visiongenerative

PDF

defense arXiv Nov 17, 2025 · Nov 2025

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Jiazhen Yan, Ziqiang Li, Fan Wang et al. · Nanjing University of Information Science and Technology · University of Macau

Novel gradient surgery framework fine-tunes CLIP for AI-generated image detection while preventing catastrophic forgetting

Output Integrity Attack visionmultimodal

PDF

attack arXiv Nov 15, 2025 · Nov 2025

Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

Jiaming Liang, Chi-Man Pun · University of Macau

Improves black-box adversarial transferability via dynamic parameter optimization, cutting grid-search complexity from O(mn) to O(n log m)

Input Manipulation Attack vision

PDF

attack EMNLP Oct 11, 2025 · Oct 2025

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Yuyi Huang, Runzhe Zhan, Lidia S.Chao et al. · Guangzhou Medical University · University of Macau

Identifies 'Path Drift' jailbreak in chain-of-thought LLMs via first-person priming, ethical evaporation, and condition chaining to bypass RLHF safety

Prompt Injection nlp

2 citations PDF

attack EMNLP Sep 23, 2025 · Sep 2025

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Yaoyao Qian, Yifan Zeng, Yuchao Jiang et al. · Northeastern University · Oregon State University +1 more

Attacks LLM-based document rankers via content injection that hijacks evaluation objectives or relevance criteria, boosting attacker documents to top positions

Prompt Injection nlp

1 citations 1 influentialPDF Code

defense arXiv Aug 18, 2025 · Aug 2025

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

Xin Chen, Junchao Wu, Shu Yang et al. · University of Macau · Chinese Academy of Sciences +2 more

Proposes RepreGuard, detecting LLM-generated text via hidden activation patterns for robust OOD detection at 94.92% AUROC

Output Integrity Attack nlp

PDF Code

defense arXiv Aug 2, 2025 · Aug 2025

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang et al. · Nanjing University of Information Science and Technology · University of Macau

Proposes NULL-Space projection on CLIP features to remove semantic bias, improving generalized AI-generated image detection by 7.4%

Output Integrity Attack visiongenerative

PDF

Latest papers

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

Efficient Preemptive Robustification with Image Sharpening

High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

OTI: A Model-free and Visually Interpretable Measure of Image Attackability

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion

IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection

How Noise Benefits AI-generated Image Detection

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue