ML Security Papers

Latest papers

10 papers

tool arXiv Mar 18, 2026 · 19d ago

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Chenyang Zhu, Maorong Wang, Jun Liu et al. · The University of Tokyo · National Institute of Informatics

Agentic framework orchestrating multiple AIGI detectors via reinforcement learning for extensible, train-free AI-generated image detection

Output Integrity Attack visionmultimodalnlp

PDF

defense arXiv Mar 6, 2026 · 4w ago

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng, Andrew Gambardella, Gouki Minegishi et al. · The University of Tokyo

Defends LLMs against jailbreaks via an explicit safety bit that makes alignment interpretable and overridable, achieving near-zero ASR

Prompt Injection nlp

PDF

attack arXiv Feb 28, 2026 · 5w ago

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Atsuki Sato, Martin Aumüller, Yusuke Matsui · The University of Tokyo · IT University of Copenhagen

Proves optimal poisoning attacks on linear regression CDF models, showing greedy multi-point attacks are suboptimal and bounding their maximum impact

Data Poisoning Attack tabular

PDF

attack arXiv Jan 17, 2026 · 11w ago

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li et al. · University of Macau · National Institute of Informatics +2 more

Hard-label black-box adversarial attack using frequency-domain initialization and pattern-driven optimization to recover gradient sign information

Input Manipulation Attack vision

PDF Code

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

cnn transformer University of Macau · National Institute of Informatics · Griffith University +1 more

PDF arXiv DOI Code

defense arXiv Jan 5, 2026 · Jan 2026

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik · The University of Tokyo · Max Planck Institute for Informatics

Self-supervised deepfake detector using personalized audio-to-expression diffusion models to catch unseen face forgeries zero-shot

Output Integrity Attack visionaudiomultimodalgenerative

PDF

benchmark arXiv Jan 4, 2026 · Jan 2026

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu et al. · Kyoto University · Hohai University +3 more

Benchmarks 27 LLMs against 50K+ multi-turn medical jailbreak conversations in Japanese, finding fine-tuned medical models are most vulnerable

Prompt Injection nlp

PDF

defense arXiv Dec 17, 2025 · Dec 2025

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen · The University of Tokyo · National Institute of Informatics +1 more

Neuron-level white-box defense suppresses toxic expert neurons in VLMs, cutting harmful outputs from 48% to 2.5% under adversarial jailbreaks

Prompt Injection nlpmultimodalvision

1 citations PDF Code

defense arXiv Oct 16, 2025 · Oct 2025

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

ChenYu Wu, Yi Wang, Yang Liao · The University of Tokyo · Xi’an Jiaotong University

Proactive honeypot defense uses a fine-tuned bait model to lure multi-turn LLM jailbreak attackers into revealing malicious intent

Prompt Injection nlp

PDF

defense Asia-Pacific Signal and Inform... Oct 10, 2025 · Oct 2025

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Yuki Nii, Futa Waseda, Ching-Chun Chang et al. · The University of Tokyo · National Institute of Informatics

Adversarial perturbations embedded in grayscale images to disrupt AI colorization models and prevent unauthorized copyright infringement

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Oct 1, 2025 · Oct 2025

Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda et al. · Turing Inc. · Institute of Science Tokyo +2 more

Reveals Differential Attention transformers are structurally more fragile to adversarial perturbations than standard attention via negative gradient alignment theory

Input Manipulation Attack visionmultimodal

PDF

Latest papers

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue