Latest papers

60 papers
defense arXiv Apr 1, 2026 · 5d ago

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Jingning Xu, Haochen Luo, Chen Liu · City University of Hong Kong

Training-free defense using text augmentation to protect VLMs against diverse adversarial image perturbations at inference time

Input Manipulation Attack multimodalvisionnlp
PDF
attack arXiv Mar 25, 2026 · 12d ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp
PDF
defense arXiv Mar 23, 2026 · 14d ago

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · University of Eastern Finland · City University of Hong Kong +3 more

Disentangles speaker traits from deepfake source embeddings using Chebyshev polynomials and Riemannian geometry for robust generator verification

Output Integrity Attack audiogenerative
PDF Code
attack arXiv Mar 18, 2026 · 19d ago

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Qianlong Xiang, Miao Zhang, Haoyu Zhang et al. · Harbin Institute of Technology · City University of Hong Kong +3 more

Text-free inversion attack that recovers supposedly erased concepts from diffusion models by exploiting persistent visual knowledge

Model Inversion Attack visiongenerative
PDF
attack arXiv Mar 18, 2026 · 19d ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Zirui Gong, Leo Yu Zhang, Yanjun Zhang et al. · Griffith University · Swinburne University of Technology +2 more

Gradient inversion attack reconstructing training data from federated learning updates via sparse activation recovery without architectural changes

Model Inversion Attack visionfederated-learning
PDF
defense arXiv Mar 11, 2026 · 26d ago

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li et al. · Zhejiang University · Nanyang Technological University +1 more

Runtime defense for LLM agents detecting indirect prompt injection via causal counterfactual analysis of tool invocations

Prompt Injection nlp
PDF Code
benchmark arXiv Mar 11, 2026 · 26d ago

Probabilistic Verification of Voice Anti-Spoofing Models

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh et al. · AXXX · HSE +5 more

Proposes PV-VASM, a black-box probabilistic framework that formally bounds misclassification risk of speech deepfake detectors against TTS and voice cloning attacks

Output Integrity Attack audio
PDF
benchmark arXiv Mar 8, 2026 · 29d ago

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

Yuhang Huang, Boyang Ma, Biwei Yan et al. · Shandong University · City University of Hong Kong

Large-scale empirical analysis reveals MCP servers fail to authenticate callers, enabling unauthorized tool access in LLM agent systems

Insecure Plugin Design nlp
PDF
attack arXiv Feb 24, 2026 · 5w ago

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

Longxiang Wang, Xiang Zheng, Xuhao Zhang et al. · City University of Hong Kong · ByteDance

Attacks multi-tenant LLM services via KV cache side-channels to reconstruct private prompts with 12× efficiency gains

Sensitive Information Disclosure nlp
PDF
defense arXiv Feb 24, 2026 · 5w ago

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Haonan An, Xiaohui Ye, Guang Hua et al. · South China University of Technology · Singapore Institute of Technology +1 more

Embeds face content as background watermark to robustly detect, localize, and recover manipulated face regions against removal attacks

Output Integrity Attack visiongenerative
PDF
attack arXiv Feb 23, 2026 · 6w ago

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Hefei Mei, Zirui Wang, Chang Xu et al. · City University of Hong Kong · The University of Sydney

Gray-box adversarial attack on LVLM vision encoders using prototype anchoring and attention-guided perturbations, achieving 75.1% score reduction

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF Code
defense arXiv Feb 11, 2026 · 7w ago

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning
PDF
attack arXiv Feb 6, 2026 · 8w ago

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Haoyang Hu, Zhejun Jiang, Yueming Lyu et al. · The University of Hong Kong · Nanjing University +1 more

Fine-tunes an LLM as a poison generator to inject robust, chunking-aware malicious content into RAG knowledge bases

Data Poisoning Attack Prompt Injection nlp
PDF
defense arXiv Feb 4, 2026 · 8w ago

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative
PDF
defense arXiv Jan 30, 2026 · 9w ago

Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

Nan Zhong, Yiran Xu, Mian Zou · City University of Hong Kong · Fudan University +1 more

Detects AI-generated images via camera CFA color correlations, achieving state-of-the-art generalization across 20+ unseen generators

Output Integrity Attack vision
PDF
attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp
PDF
attack arXiv Jan 24, 2026 · 10w ago

Reconstructing Training Data from Adapter-based Federated Large Language Models

Silong Chen, Yuchuan Luo, Guilin Deng et al. · National University of Defense Technology · City University of Hong Kong

Gradient inversion attack reconstructs training text from LoRA adapter gradients in federated LLMs achieving ROUGE-1/2 over 99

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning
PDF Code
defense arXiv Jan 20, 2026 · 10w ago

SecureSplit: Mitigating Backdoor Attacks in Split Learning

Zhihao Dou, Dongfei Cui, Weida Wang et al. · Case Western Reserve University · Northeast Electric Power University +6 more

Defends split learning against backdoor attacks by transforming embeddings and filtering poisoned ones via majority-voting scheme

Model Poisoning visionfederated-learning
PDF
defense TDSC Jan 17, 2026 · 11w ago

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Haonan An, Guang Hua, Wei Du et al. · City University of Hong Kong · Singapore Institute of Technology +3 more

Defends box-free model watermarks in generative model outputs against gradient-leakage-based removal attacks using provable gradient-manipulation shields

Output Integrity Attack visiongenerative
1 citations PDF
benchmark arXiv Jan 6, 2026 · Jan 2026

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

Xiangzhe Yuan, Zhenhao Zhang, Haoming Tang et al. · University of Iowa · City University of Hong Kong

Red-teams eight LLMs as conversational scam attackers and victims across 18,648 multi-turn dialogues to map safety failure modes

Prompt Injection nlp
PDF
Loading more papers…