Latest papers

14 papers
attack arXiv Mar 25, 2026 · 12d ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp
PDF
defense arXiv Jan 30, 2026 · 9w ago

DNA: Uncovering Universal Latent Forgery Knowledge

Jingtong Dou, Chuancheng Shi, Yemin Wang et al. · The University of Sydney · Xiamen University +2 more

Probes latent neurons in pre-trained vision models to detect AI-generated images without costly fine-tuning, outperforming black-box baselines

Output Integrity Attack vision
PDF
attack arXiv Jan 14, 2026 · 11w ago

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Zhiyi Mou, Jingyuan Yang, Zeheng Qian et al. · Zhejiang University · The University of Sydney +2 more

Jailbreaks LLMs by spatially redistributing tokens across rows/columns/diagonals, bypassing guardrails including OpenAI Moderation API at >75% ASR

Prompt Injection nlp
PDF Code
defense arXiv Jan 13, 2026 · 11w ago

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Qitao Tan, Xiaoying Song, Ningxi Cheng et al. · University of Georgia · University of North Texas +2 more

Recovers LLM safety alignment eroded by fine-tuning via post-training quantization, without retraining, in 40 minutes on one GPU

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Jan 4, 2026 · Jan 2026

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu et al. · Kyoto University · Hohai University +3 more

Benchmarks 27 LLMs against 50K+ multi-turn medical jailbreak conversations in Japanese, finding fine-tuned medical models are most vulnerable

Prompt Injection nlp
PDF
attack arXiv Nov 30, 2025 · Nov 2025

The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches

Haojie Ji, Te Hu, Haowen Li et al. · Beijing Information Science & Technology University · Hong Kong Polytechnic University

Proposes stealthy edge-aligned adversarial patches on traffic signs that fool classifiers while evading human visual detection

Input Manipulation Attack vision
PDF
attack arXiv Oct 16, 2025 · Oct 2025

Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models

Xiaoyu Xue, Yuni Lai, Chenxi Huang et al. · Hong Kong Polytechnic University · Shanghai Jiao Tong University +1 more

Dual-trigger backdoor attack on LM-empowered graph foundation models exploiting unsecured prompt tuning via text and structural triggers

Model Poisoning Transfer Learning Attack graphnlp
PDF
defense arXiv Oct 7, 2025 · Oct 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Qingyu Yin, Chak Tou Leong, Linyi Yang et al. · Zhejiang University · Xiaohongshu Inc. +6 more

Reveals mechanistic cause of safety alignment failure in reasoning LLMs and proposes data-efficient alignment repair via refusal cliff data selection

Prompt Injection nlp
2 citations PDF Code
benchmark WWW Sep 23, 2025 · Sep 2025

MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective

Xinwei Zhang, Haibo Hu, Qingqing Ye et al. · Hong Kong Polytechnic University · Ltd.

Proposes NTK-based theoretical metrics to quantify model extraction risk across architectures without assuming a specific attack strategy

Model Theft vision
4 citations PDF
attack arXiv Sep 17, 2025 · Sep 2025

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He et al. · Harbin Institute of Technology · Hong Kong Polytechnic University +2 more

Jailbreaks LLMs by reframing harmful queries as educational learning questions, bypassing safety alignment on 22 models

Prompt Injection nlp
PDF Code
defense arXiv Sep 7, 2025 · Sep 2025

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao et al. · City University of Macau · Monash University +1 more

Defends biometric privacy by generating adversarial GAN perturbations that blind black-box retinal age prediction models while preserving diagnostic image utility

Input Manipulation Attack vision
PDF
attack arXiv Aug 13, 2025 · Aug 2025

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen et al. · Shanghai Jiao Tong University · Columbia University +3 more

Multi-target backdoor attack on VLM visual grounding using dynamic text-conditioned UNet triggers to hijack object localization

Model Poisoning visionmultimodalnlp
PDF Code
attack arXiv Aug 3, 2025 · Aug 2025

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang et al. · Duke University · North China University of Technology +7 more

Attacks LLMs via component-wise text perturbations, revealing heterogeneous adversarial robustness across dissected prompt structures

Prompt Injection nlp
PDF Code
attack arXiv Jan 10, 2025 · Jan 2025

UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping

Yanjie Li, Kaisheng Liang, Bin Xiao · Hong Kong Polytechnic University

Physical adversarial clothing attack on person detectors using dynamic NeRF UV mapping, achieving 92.7% evasion ASR across diverse poses

Input Manipulation Attack vision
PDF Code