ML Security Papers

Latest papers

10 papers

defense arXiv Mar 6, 2026 · 4w ago

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +4 more

Black-box backdoor detector for text-to-image diffusion models using semantic instruction-response deviation across varied prompts

Model Poisoning visiongenerativemultimodal

PDF Code

defense arXiv Mar 3, 2026 · 4w ago

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang et al. · University of Chinese Academy of Sciences · Institute of Information Engineering +1 more

Defends LLMs against adversarial prefix jailbreaks by causal probing to pin malicious intent across autoregressive generation

Prompt Injection nlp

PDF

attack arXiv Feb 7, 2026 · 8w ago

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun, Minrui Luo, Yu Wang et al. · Shanghai Qi Zhi Institute · East China Normal University +3 more

Recovers private edited data from LLM parameter update matrices using spectral analysis and entropy-based prompt reconstruction

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Feb 2, 2026 · 9w ago

Backdoor Sentinel: Detecting and Detoxifying Backdoors in Diffusion Models via Temporal Noise Consistency

Bingzheng Wang, Xiaoyan Gu, Hongbo Xu et al. · Institute of Information Engineering

Detects and detoxifies backdoors in diffusion models by exploiting temporal noise inconsistency patterns introduced by triggers across denoising timesteps

Model Poisoning visiongenerative

PDF

defense arXiv Feb 2, 2026 · 9w ago

WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences

Proposes WorldCup, a multi-bit LLM output watermarking scheme embedding provenance bits directly into token sampling via hierarchical competition

Output Integrity Attack nlp

PDF

defense arXiv Jan 27, 2026 · 9w ago

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection nlp

PDF

attack arXiv Nov 3, 2025 · Nov 2025

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Indirect prompt injection hidden inside academic papers hijacks LLM-based AI reviewers into awarding perfect scores

Prompt Injection nlp

1 citations PDF

attack arXiv Sep 14, 2025 · Sep 2025

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Yibo Zhang, Liang Lin · Beijing University of Posts and Telecommunications · Institute of Information Engineering

Genetic algorithm optimizes real-world background noise into adversarial audio that jailbreaks Audio Large Models with 95% success rate

Input Manipulation Attack Prompt Injection audionlp

PDF

tool arXiv Sep 8, 2025 · Sep 2025

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Yilin Li, Guozhu Meng, Mingyang Sun et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Decompiles on-device DNN executables to recover model architecture and weights, enabling model theft from edge deployments

Model Theft vision

PDF

defense arXiv Aug 3, 2025 · Aug 2025

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

Siran Peng, Haoyuan Zhang, Li Gao et al. · Institute of Automation · University of Chinese Academy of Sciences +4 more

Diffusion-based encoder-decoder detects face forgeries and localizes artifacts jointly for improved explainability

Output Integrity Attack visiongenerative

PDF

Latest papers

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Reverse-Engineering Model Editing on Language Models

Backdoor Sentinel: Detecting and Detoxifying Backdoors in Diffusion Models via Temporal Noise Consistency

WorldCup Sampling for Multi-bit LLM Watermarking

RvB: Automating AI System Hardening via Iterative Red-Blue Games

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue