ML Security Papers

Latest papers

64 papers

attack arXiv Apr 23, 2026 · 28d ago

Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Guilin Deng, Silong Chen, Yuchuan Luo et al. · National University of Defense Technology · City University of Hong Kong +1 more

Gradient-based membership inference attack on federated LLMs achieving near-perfect accuracy via projection residual analysis

Membership Inference Attack nlpfederated-learning

PDF Code

attack arXiv Apr 18, 2026 · 4w ago

Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning

Jiachen Qian · City University of Hong Kong

Multimodal memory poisoning attack that embeds visual triggers in images to hijack AI agent planning, plus dual-process defense

Input Manipulation Attack Data Poisoning Attack Prompt Injection Excessive Agency multimodalnlp

PDF

defense arXiv Apr 18, 2026 · 4w ago

CapSeal: Capability-Sealed Secret Mediation for Secure Agent Execution

Shutong Jin, Ruiyi Guo, Ray C. C. Cheung · City University of Hong Kong · Beijing Foreign Studies University

Broker-mediated capability system that prevents AI agents from directly accessing secrets, defending against prompt injection exfiltration

Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Apr 16, 2026 · 5w ago

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

Weiwei Zhuang, Wangze Xie, Qi Zhang et al. · Xiamen University of Technology · City University of Macau +8 more

Generates physically plausible fog-based adversarial perturbations for remote sensing classifiers with high transferability and defense robustness

Input Manipulation Attack vision

PDF

defense arXiv Apr 1, 2026 · 7w ago

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Jingning Xu, Haochen Luo, Chen Liu · City University of Hong Kong

Training-free defense using text augmentation to protect VLMs against diverse adversarial image perturbations at inference time

Input Manipulation Attack multimodalvisionnlp

PDF

attack arXiv Mar 25, 2026 · 8w ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp

PDF

defense arXiv Mar 23, 2026 · 8w ago

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · University of Eastern Finland · City University of Hong Kong +3 more

Disentangles speaker traits from deepfake source embeddings using Chebyshev polynomials and Riemannian geometry for robust generator verification

Output Integrity Attack audiogenerative

PDF Code

attack arXiv Mar 18, 2026 · 9w ago

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Qianlong Xiang, Miao Zhang, Haoyu Zhang et al. · Harbin Institute of Technology · City University of Hong Kong +3 more

Text-free inversion attack that recovers supposedly erased concepts from diffusion models by exploiting persistent visual knowledge

Model Inversion Attack visiongenerative

PDF

attack arXiv Mar 18, 2026 · 9w ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Zirui Gong, Leo Yu Zhang, Yanjun Zhang et al. · Griffith University · Swinburne University of Technology +2 more

Gradient inversion attack reconstructing training data from federated learning updates via sparse activation recovery without architectural changes

Model Inversion Attack visionfederated-learning

PDF

benchmark arXiv Mar 11, 2026 · 10w ago

Probabilistic Verification of Voice Anti-Spoofing Models

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh et al. · AXXX · HSE +5 more

Proposes PV-VASM, a black-box probabilistic framework that formally bounds misclassification risk of speech deepfake detectors against TTS and voice cloning attacks

Output Integrity Attack audio

PDF

defense arXiv Mar 11, 2026 · 10w ago

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li et al. · Zhejiang University · Nanyang Technological University +1 more

Runtime defense for LLM agents detecting indirect prompt injection via causal counterfactual analysis of tool invocations

Prompt Injection nlp

PDF Code

benchmark arXiv Mar 8, 2026 · 10w ago

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

Yuhang Huang, Boyang Ma, Biwei Yan et al. · Shandong University · City University of Hong Kong

Large-scale empirical analysis reveals MCP servers fail to authenticate callers, enabling unauthorized tool access in LLM agent systems

Insecure Plugin Design nlp

PDF

defense arXiv Feb 24, 2026 · 12w ago

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Haonan An, Xiaohui Ye, Guang Hua et al. · South China University of Technology · Singapore Institute of Technology +1 more

Embeds face content as background watermark to robustly detect, localize, and recover manipulated face regions against removal attacks

Output Integrity Attack visiongenerative

PDF

The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.

gan diffusion cnn South China University of Technology · Singapore Institute of Technology · City University of Hong Kong

PDF arXiv DOI

attack arXiv Feb 24, 2026 · 12w ago

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

Longxiang Wang, Xiang Zheng, Xuhao Zhang et al. · City University of Hong Kong · ByteDance

Attacks multi-tenant LLM services via KV cache side-channels to reconstruct private prompts with 12× efficiency gains

Sensitive Information Disclosure nlp

PDF

attack arXiv Feb 23, 2026 · 12w ago

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Hefei Mei, Zirui Wang, Chang Xu et al. · City University of Hong Kong · The University of Sydney

Gray-box adversarial attack on LVLM vision encoders using prototype anchoring and attention-guided perturbations, achieving 75.1% score reduction

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF Code

defense arXiv Feb 11, 2026 · Feb 2026

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

attack arXiv Feb 6, 2026 · Feb 2026

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Haoyang Hu, Zhejun Jiang, Yueming Lyu et al. · The University of Hong Kong · Nanjing University +1 more

Fine-tunes an LLM as a poison generator to inject robust, chunking-aware malicious content into RAG knowledge bases

Data Poisoning Attack Prompt Injection nlp

PDF

defense arXiv Feb 4, 2026 · Feb 2026

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative

PDF

defense arXiv Jan 30, 2026 · Jan 2026

Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

Nan Zhong, Yiran Xu, Mian Zou · City University of Hong Kong · Fudan University +1 more

Detects AI-generated images via camera CFA color correlations, achieving state-of-the-art generalization across 20+ unseen generators

Output Integrity Attack vision

PDF

attack arXiv Jan 29, 2026 · Jan 2026

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp

PDF

Loading more papers…

Latest papers

Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning

CapSeal: Capability-Sealed Secret Mediation for Secure Agent Execution

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

How Vulnerable Are Edge LLMs?

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Probabilistic Verification of Voice Anti-Spoofing Models

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue