ML Security Papers

Latest papers

16 papers

defense arXiv Apr 29, 2026 · 22d ago

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Yanyun Wang, Qingqing Ye, Li Liu et al. · Hong Kong Polytechnic University · Hong Kong University of Science and Technology

Adversarial training method that harmonizes clean accuracy and robustness by aligning input perturbations with latent space representations

Input Manipulation Attack vision

PDF

attack arXiv Apr 25, 2026 · 26d ago

Toward Polymorphic Backdoor against Semantic Communication via Intensity-Based Poisoning

Xiao Yang, Yuni Lai, Gaolei Li et al. · Shanghai Jiao Tong University · Hong Kong Polytechnic University +1 more

Polymorphic backdoor attack on semantic communication systems using intensity-graded triggers for multiple target outputs plus provable defense

Model Poisoning Data Poisoning Attack visionmultimodal

PDF

attack arXiv Mar 25, 2026 · 8w ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp

PDF

defense arXiv Jan 30, 2026 · Jan 2026

DNA: Uncovering Universal Latent Forgery Knowledge

Jingtong Dou, Chuancheng Shi, Yemin Wang et al. · The University of Sydney · Xiamen University +2 more

Probes latent neurons in pre-trained vision models to detect AI-generated images without costly fine-tuning, outperforming black-box baselines

Output Integrity Attack vision

PDF

attack arXiv Jan 14, 2026 · Jan 2026

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Zhiyi Mou, Jingyuan Yang, Zeheng Qian et al. · Zhejiang University · The University of Sydney +2 more

Jailbreaks LLMs by spatially redistributing tokens across rows/columns/diagonals, bypassing guardrails including OpenAI Moderation API at >75% ASR

Prompt Injection nlp

PDF Code

defense arXiv Jan 13, 2026 · Jan 2026

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Qitao Tan, Xiaoying Song, Ningxi Cheng et al. · University of Georgia · University of North Texas +2 more

Recovers LLM safety alignment eroded by fine-tuning via post-training quantization, without retraining, in 40 minutes on one GPU

Transfer Learning Attack Prompt Injection nlp

PDF Code

benchmark arXiv Jan 4, 2026 · Jan 2026

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu et al. · Kyoto University · Hohai University +3 more

Benchmarks 27 LLMs against 50K+ multi-turn medical jailbreak conversations in Japanese, finding fine-tuned medical models are most vulnerable

Prompt Injection nlp

PDF

attack arXiv Nov 30, 2025 · Nov 2025

The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches

Haojie Ji, Te Hu, Haowen Li et al. · Beijing Information Science & Technology University · Hong Kong Polytechnic University

Proposes stealthy edge-aligned adversarial patches on traffic signs that fool classifiers while evading human visual detection

Input Manipulation Attack vision

PDF

attack arXiv Oct 16, 2025 · Oct 2025

Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models

Xiaoyu Xue, Yuni Lai, Chenxi Huang et al. · Hong Kong Polytechnic University · Shanghai Jiao Tong University +1 more

Dual-trigger backdoor attack on LM-empowered graph foundation models exploiting unsecured prompt tuning via text and structural triggers

Model Poisoning Transfer Learning Attack graphnlp

PDF

defense arXiv Oct 7, 2025 · Oct 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Qingyu Yin, Chak Tou Leong, Linyi Yang et al. · Zhejiang University · Xiaohongshu Inc. +6 more

Reveals mechanistic cause of safety alignment failure in reasoning LLMs and proposes data-efficient alignment repair via refusal cliff data selection

Prompt Injection nlp

2 citations PDF Code

benchmark WWW Sep 23, 2025 · Sep 2025

MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective

Xinwei Zhang, Haibo Hu, Qingqing Ye et al. · Hong Kong Polytechnic University · Ltd.

Proposes NTK-based theoretical metrics to quantify model extraction risk across architectures without assuming a specific attack strategy

Model Theft vision

4 citations PDF

attack arXiv Sep 17, 2025 · Sep 2025

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He et al. · Harbin Institute of Technology · Hong Kong Polytechnic University +2 more

Jailbreaks LLMs by reframing harmful queries as educational learning questions, bypassing safety alignment on 22 models

Prompt Injection nlp

PDF Code

defense arXiv Sep 7, 2025 · Sep 2025

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao et al. · City University of Macau · Monash University +1 more

Defends biometric privacy by generating adversarial GAN perturbations that blind black-box retinal age prediction models while preserving diagnostic image utility

Input Manipulation Attack vision

PDF

attack arXiv Aug 13, 2025 · Aug 2025

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen et al. · Shanghai Jiao Tong University · Columbia University +3 more

Multi-target backdoor attack on VLM visual grounding using dynamic text-conditioned UNet triggers to hijack object localization

Model Poisoning visionmultimodalnlp

PDF Code

attack arXiv Aug 3, 2025 · Aug 2025

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang et al. · Duke University · North China University of Technology +7 more

Attacks LLMs via component-wise text perturbations, revealing heterogeneous adversarial robustness across dissected prompt structures

Prompt Injection nlp

PDF Code

attack arXiv Jan 10, 2025 · Jan 2025

UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping

Yanjie Li, Kaisheng Liang, Bin Xiao · Hong Kong Polytechnic University

Physical adversarial clothing attack on person detectors using dynamic NeRF UV mapping, achieving 92.7% evasion ASR across diverse poses

Input Manipulation Attack vision

PDF Code

Latest papers

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Toward Polymorphic Backdoor against Semantic Communication via Intensity-Based Poisoning

How Vulnerable Are Edge LLMs?

DNA: Uncovering Universal Latent Forgery Knowledge

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches

Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue