ML Security Papers

Latest papers

93 papers

defense arXiv Mar 23, 2026 · 16d ago

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · University of Eastern Finland · City University of Hong Kong +3 more

Disentangles speaker traits from deepfake source embeddings using Chebyshev polynomials and Riemannian geometry for robust generator verification

Output Integrity Attack audiogenerative

PDF Code

survey arXiv Mar 23, 2026 · 16d ago

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

Yanming Mu, Hao Hu, Feiyang Li et al. · State Key Laboratory of Mathematical Engineering and Advanced Computing · Information Engineering University +2 more

First end-to-end survey mapping RAG security threats, defenses, and benchmarks across the entire pipeline

Prompt Injection Training Data Poisoning Sensitive Information Disclosure nlp

PDF

attack arXiv Mar 20, 2026 · 19d ago

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

Men Niu, Xinxin Fan, Quanliang Jing et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Introduces three collusive policy-level attacks on cooperative MARL where multiple malicious agents coordinate to disrupt teamwork

Input Manipulation Attack reinforcement-learning

PDF

defense arXiv Mar 16, 2026 · 23d ago

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Zhuoshang Wang, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Black-box framework for third-party watermark detection in LLM outputs using proxy models and statistical tests

Output Integrity Attack nlp

PDF

defense arXiv Mar 13, 2026 · 26d ago

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie, Jie Zhang, Zhongqi Wang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences

Freezes pre-trained VLM weights and adapts only shallow layers to achieve adversarial robustness without sacrificing clean accuracy

Input Manipulation Attack visionnlpmultimodal

PDF Code

defense arXiv Mar 13, 2026 · 26d ago

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

Xiangkui Cao, Jie Zhang, Meina Kan et al. · Institute of Computing Technology · University of Chinese Academy of Sciences

Neuron-level model editing technique that teaches vision-language models to refuse privacy-invasive queries while preserving utility

Sensitive Information Disclosure Prompt Injection multimodalnlpvision

PDF

defense arXiv Mar 13, 2026 · 26d ago

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu, Yanshu Li, Wen Liu et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences

Black-box adversarial text detector using replaced token detection to identify word-substitution attacks with only two model queries

Input Manipulation Attack nlp

PDF

defense arXiv Mar 10, 2026 · 29d ago

ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

Yuqi Qian, Yun Cao, Haocheng Fu et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Embeds robust provenance watermarks in diffusion model noise using structural encoding to survive lossy post-processing

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 6, 2026 · 4w ago

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +4 more

Black-box backdoor detector for text-to-image diffusion models using semantic instruction-response deviation across varied prompts

Model Poisoning visiongenerativemultimodal

PDF Code

defense arXiv Mar 4, 2026 · 5w ago

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Yifan Zhu, Yibo Miao, Yinpeng Dong et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Proposes MI-UE, a theoretically grounded availability-poisoning defense that blocks unauthorized model training by reducing mutual information in poisoned image features

Data Poisoning Attack vision

PDF

defense arXiv Mar 3, 2026 · 5w ago

StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma, Xun Lin, Hui Ma et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Steganographic framework hides faces in cover images and detects deepfakes directly in the hidden domain to prevent facial privacy leakage

Output Integrity Attack vision

PDF

Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

cnn transformer Chinese Academy of Sciences · University of Chinese Academy of Sciences · Beihang University +2 more

PDF arXiv

defense arXiv Mar 3, 2026 · 5w ago

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang et al. · University of Chinese Academy of Sciences · Institute of Information Engineering +1 more

Defends LLMs against adversarial prefix jailbreaks by causal probing to pin malicious intent across autoregressive generation

Prompt Injection nlp

PDF

benchmark arXiv Mar 2, 2026 · 5w ago

CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection

Yiheng Li, Zichang Tan, Guoqing Xu et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Benchmarks AI-generated CT image detection with a 10-model dataset and novel wavelet-spatial-frequency CNN detector

Output Integrity Attack vision

PDF Code

defense arXiv Feb 12, 2026 · 7w ago

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan, Jian Liang, Ran He et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Defends against LLM attribute inference attacks using fine-grained anonymization and adversarial suffix optimization to induce model rejection

Sensitive Information Disclosure nlp

1 citations PDF Code

defense arXiv Feb 10, 2026 · 8w ago

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Yuwei Chen, Zhenliang He, Jia Tang et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Proposes a one-step diffusion model to extract Gaussian Shading watermarks 20x faster with higher accuracy than multi-step inversion

Output Integrity Attack generative

PDF

attack arXiv Feb 9, 2026 · 8w ago

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan, Sheng Sun, Shengjia Cheng et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Jailbreaks VLMs by entangling harmful multi-hop instructions across text and image modalities to evade safety alignment

Prompt Injection multimodalvisionnlp

PDF

tool arXiv Feb 9, 2026 · 8w ago

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Hao Tan, Jun Lan, Senyuan Shi et al. · Institute of Automation · Ant Group +2 more

Detects AI-generated videos using MLLMs enhanced with perception pretext reinforcement learning and a new 3K-video benchmark

Output Integrity Attack visionmultimodalnlp

PDF Code

attack arXiv Feb 5, 2026 · 8w ago

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang et al. · Institute of Software Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Causal front-door adjustment framework strips LLM safety features via Sparse Autoencoders to achieve state-of-the-art jailbreak success rates

Prompt Injection nlp

PDF

defense arXiv Feb 3, 2026 · 9w ago

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey et al. · University of Eastern Finland · Laboratoire de Physique de l'Ecole Normale Supérieure +2 more

Proposes wavelet scattering transform features for interpretable speech deepfake detection, outperforming SSL front-ends on a challenging benchmark

Output Integrity Attack audio

PDF

defense arXiv Feb 2, 2026 · 9w ago

WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences

Proposes WorldCup, a multi-bit LLM output watermarking scheme embedding provenance bits directly into token sampling via hierarchical competition

Output Integrity Attack nlp

PDF

Loading more papers…

Latest papers

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

WorldCup Sampling for Multi-bit LLM Watermarking

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue