ML Security Papers

Latest papers

12 papers

defense arXiv Mar 26, 2026 · 11d ago

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi et al. · National Institute of Informatics · Academia Sinica +2 more

Self-supervised multimodal deepfake detector trained on real videos, detecting visual tampering artifacts and audio-visual lip-sync inconsistencies

Output Integrity Attack multimodalvisionaudio

PDF

tool arXiv Mar 18, 2026 · 19d ago

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Chenyang Zhu, Maorong Wang, Jun Liu et al. · The University of Tokyo · National Institute of Informatics

Agentic framework orchestrating multiple AIGI detectors via reinforcement learning for extensible, train-free AI-generated image detection

Output Integrity Attack visionmultimodalnlp

PDF

defense arXiv Feb 26, 2026 · 5w ago

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran, Xin Wang, Wanying Ge et al. · Université de Rennes · National Institute of Informatics

Fine-tunes Whisper to detect synthetic deepfake words in audio via next-token prediction with special boundary tokens

Output Integrity Attack audio

PDF

attack arXiv Jan 28, 2026 · 9w ago

Self Voice Conversion as an Attack against Neural Audio Watermarking

Yigitcan Özer, Wanying Ge, Zhe Zhang et al. · National Institute of Informatics

Attacks audio watermarks by passing speech through self voice conversion, stripping embedded marks while preserving speaker identity and content

Output Integrity Attack audio

1 citations PDF

attack arXiv Jan 17, 2026 · 11w ago

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li et al. · University of Macau · National Institute of Informatics +2 more

Hard-label black-box adversarial attack using frequency-domain initialization and pattern-driven optimization to recover gradient sign information

Input Manipulation Attack vision

PDF Code

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

cnn transformer University of Macau · National Institute of Informatics · Griffith University +1 more

PDF arXiv DOI Code

defense arXiv Dec 17, 2025 · Dec 2025

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen · The University of Tokyo · National Institute of Informatics +1 more

Neuron-level white-box defense suppresses toxic expert neurons in VLMs, cutting harmful outputs from 48% to 2.5% under adversarial jailbreaks

Prompt Injection nlpmultimodalvision

1 citations PDF Code

attack arXiv Oct 30, 2025 · Oct 2025

FGGM: Formal Grey-box Gradient Method for Attacking DRL-based MU-MIMO Scheduler

Thanh Le, Hai Duong, Yusheng Ji et al. · The Graduate University for Advanced Studies · National Institute of Informatics +2 more

Grey-box attack on DRL-based 5G schedulers uses polytope abstract domains to craft adversarial CSI inputs degrading victim throughput by 70%

Input Manipulation Attack reinforcement-learning

1 citations PDF

defense Asia-Pacific Signal and Inform... Oct 10, 2025 · Oct 2025

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Yuki Nii, Futa Waseda, Ching-Chun Chang et al. · The University of Tokyo · National Institute of Informatics

Adversarial perturbations embedded in grayscale images to disrupt AI colorization models and prevent unauthorized copyright infringement

Output Integrity Attack visiongenerative

PDF

defense arXiv Oct 6, 2025 · Oct 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang et al. · University of Eastern Finland · National Institute of Informatics +4 more

Novel wavelet prompt-tuning architecture for speech deepfake detection, outperforming SOTA on two benchmarks with far fewer trainable parameters

Output Integrity Attack audio

1 citations PDF Code

defense arXiv Sep 24, 2025 · Sep 2025

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Tai-Ming Huang, Wei-Tung Lin, Kai-Lung Hua et al. · National Taiwan University · Academia Sinica +3 more

Detects AI-generated images via MLLM step-by-step reasoning trained with GRPO reinforcement learning, achieving strong zero-shot generalization

Output Integrity Attack visionmultimodal

3 citations 1 influentialPDF

defense arXiv Sep 22, 2025 · Sep 2025

Distributionally Robust Safety Verification of Neural Networks via Worst-Case CVaR

Masako Kishida · National Institute of Informatics

Extends SDP-based neural network verification with worst-case CVaR to certify safety under distributional input uncertainty and tail risk

Input Manipulation Attack

PDF

attack arXiv Jan 4, 2025 · Jan 2025

BADTV: Unveiling Backdoor Threats in Third-Party Task Vectors

Chia-Yi Hsu, Yu-Lin Tsai, Yu Zhe et al. · National Yang Ming Chiao Tung University · University of Tsukuba +2 more

Backdoor attack on task vectors that persists across task learning, forgetting, and analogy arithmetic operations, evading all tested defenses

Model Poisoning Transfer Learning Attack visionnlpmultimodal

2 citations PDF

Latest papers

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Self Voice Conversion as an Attack against Neural Audio Watermarking

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

FGGM: Formal Grey-box Gradient Method for Attacking DRL-based MU-MIMO Scheduler

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Distributionally Robust Safety Verification of Neural Networks via Worst-Case CVaR

BADTV: Unveiling Backdoor Threats in Third-Party Task Vectors

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue