Latest papers

93 papers
defense arXiv Mar 23, 2026 · 16d ago

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · University of Eastern Finland · City University of Hong Kong +3 more

Disentangles speaker traits from deepfake source embeddings using Chebyshev polynomials and Riemannian geometry for robust generator verification

Output Integrity Attack audiogenerative
PDF Code
survey arXiv Mar 23, 2026 · 16d ago

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

Yanming Mu, Hao Hu, Feiyang Li et al. · State Key Laboratory of Mathematical Engineering and Advanced Computing · Information Engineering University +2 more

First end-to-end survey mapping RAG security threats, defenses, and benchmarks across the entire pipeline

Prompt Injection Training Data Poisoning Sensitive Information Disclosure nlp
PDF
attack arXiv Mar 20, 2026 · 19d ago

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

Men Niu, Xinxin Fan, Quanliang Jing et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Introduces three collusive policy-level attacks on cooperative MARL where multiple malicious agents coordinate to disrupt teamwork

Input Manipulation Attack reinforcement-learning
PDF
defense arXiv Mar 16, 2026 · 23d ago

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Zhuoshang Wang, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Black-box framework for third-party watermark detection in LLM outputs using proxy models and statistical tests

Output Integrity Attack nlp
PDF
defense arXiv Mar 13, 2026 · 26d ago

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie, Jie Zhang, Zhongqi Wang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences

Freezes pre-trained VLM weights and adapts only shallow layers to achieve adversarial robustness without sacrificing clean accuracy

Input Manipulation Attack visionnlpmultimodal
PDF Code
defense arXiv Mar 13, 2026 · 26d ago

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

Xiangkui Cao, Jie Zhang, Meina Kan et al. · Institute of Computing Technology · University of Chinese Academy of Sciences

Neuron-level model editing technique that teaches vision-language models to refuse privacy-invasive queries while preserving utility

Sensitive Information Disclosure Prompt Injection multimodalnlpvision
PDF
defense arXiv Mar 13, 2026 · 26d ago

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu, Yanshu Li, Wen Liu et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences

Black-box adversarial text detector using replaced token detection to identify word-substitution attacks with only two model queries

Input Manipulation Attack nlp
PDF
defense arXiv Mar 10, 2026 · 29d ago

ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

Yuqi Qian, Yun Cao, Haocheng Fu et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Embeds robust provenance watermarks in diffusion model noise using structural encoding to survive lossy post-processing

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 6, 2026 · 4w ago

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +4 more

Black-box backdoor detector for text-to-image diffusion models using semantic instruction-response deviation across varied prompts

Model Poisoning visiongenerativemultimodal
PDF Code
defense arXiv Mar 4, 2026 · 5w ago

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Yifan Zhu, Yibo Miao, Yinpeng Dong et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Proposes MI-UE, a theoretically grounded availability-poisoning defense that blocks unauthorized model training by reducing mutual information in poisoned image features

Data Poisoning Attack vision
PDF
defense arXiv Mar 3, 2026 · 5w ago

StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma, Xun Lin, Hui Ma et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Steganographic framework hides faces in cover images and detects deepfakes directly in the hidden domain to prevent facial privacy leakage

Output Integrity Attack vision
PDF
defense arXiv Mar 3, 2026 · 5w ago

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang et al. · University of Chinese Academy of Sciences · Institute of Information Engineering +1 more

Defends LLMs against adversarial prefix jailbreaks by causal probing to pin malicious intent across autoregressive generation

Prompt Injection nlp
PDF
benchmark arXiv Mar 2, 2026 · 5w ago

CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection

Yiheng Li, Zichang Tan, Guoqing Xu et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Benchmarks AI-generated CT image detection with a 10-model dataset and novel wavelet-spatial-frequency CNN detector

Output Integrity Attack vision
PDF Code
defense arXiv Feb 12, 2026 · 7w ago

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan, Jian Liang, Ran He et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Defends against LLM attribute inference attacks using fine-grained anonymization and adversarial suffix optimization to induce model rejection

Sensitive Information Disclosure nlp
1 citations PDF Code
defense arXiv Feb 10, 2026 · 8w ago

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Yuwei Chen, Zhenliang He, Jia Tang et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Proposes a one-step diffusion model to extract Gaussian Shading watermarks 20x faster with higher accuracy than multi-step inversion

Output Integrity Attack generative
PDF
attack arXiv Feb 9, 2026 · 8w ago

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan, Sheng Sun, Shengjia Cheng et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Jailbreaks VLMs by entangling harmful multi-hop instructions across text and image modalities to evade safety alignment

Prompt Injection multimodalvisionnlp
PDF
tool arXiv Feb 9, 2026 · 8w ago

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Hao Tan, Jun Lan, Senyuan Shi et al. · Institute of Automation · Ant Group +2 more

Detects AI-generated videos using MLLMs enhanced with perception pretext reinforcement learning and a new 3K-video benchmark

Output Integrity Attack visionmultimodalnlp
PDF Code
attack arXiv Feb 5, 2026 · 8w ago

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang et al. · Institute of Software Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Causal front-door adjustment framework strips LLM safety features via Sparse Autoencoders to achieve state-of-the-art jailbreak success rates

Prompt Injection nlp
PDF
defense arXiv Feb 3, 2026 · 9w ago

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey et al. · University of Eastern Finland · Laboratoire de Physique de l'Ecole Normale Supérieure +2 more

Proposes wavelet scattering transform features for interpretable speech deepfake detection, outperforming SSL front-ends on a challenging benchmark

Output Integrity Attack audio
PDF
defense arXiv Feb 2, 2026 · 9w ago

WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences

Proposes WorldCup, a multi-bit LLM output watermarking scheme embedding provenance bits directly into token sampling via hierarchical competition

Output Integrity Attack nlp
PDF
Loading more papers…