Latest papers

16 papers
attack arXiv Mar 24, 2026 · 13d ago

AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

Yutao Luo, Haotian Zhu, Shuchao Pang et al. · Nanjing University of Science and Technology · Macquarie University +3 more

Backdoor attack on mobile GUI agents using benign notification icons to trigger malicious actions with 90%+ success rate

Model Poisoning visionmultimodal
PDF
attack arXiv Feb 26, 2026 · 5w ago

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia et al. · Nanyang Technological University · BraneMatrix AI +7 more

Bio-inspired optimization generates classical Chinese jailbreak prompts that defeat modern-language safety guardrails in black-box LLMs

Prompt Injection nlp
PDF
defense arXiv Feb 1, 2026 · 9w ago

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu et al. · Nanjing University of Science and Technology · National University of Singapore +2 more

Identifies sparse cross-lingual safety neurons in LLMs and proposes targeted fine-tuning to close multilingual jailbreak safety gaps

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Enyi Shi, Pengyang Shao, Yanxin Zhang et al. · Nanjing University of Science and Technology · National University of Singapore +3 more

Multilingual multimodal safety benchmark revealing cross-lingual asymmetries in VLLM jailbreak susceptibility across 10 languages and 11 models

Prompt Injection multimodalnlp
PDF Code
defense arXiv Jan 29, 2026 · 9w ago

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu et al. · The University of Sydney · Nanjing University of Science and Technology +2 more

Defends LLMs, diffusion models, and MLLMs from jailbreaks by tracing and severing harmful semantic circuits via sparse autoencoders and causal path analysis

Input Manipulation Attack Prompt Injection nlpvisionmultimodalgenerative
PDF
tool arXiv Dec 22, 2025 · Dec 2025

DREAM: Dynamic Red-teaming across Environments for AI Models

Liming Lu, Xiang Gu, Junyu Huang et al. · Nanjing University of Science and Technology · The University of Hong Kong +3 more

Automated red-teaming tool for LLM agents that chains 1,986 atomic attacks across 349 environments, achieving 70%+ bypass rates

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Dec 8, 2025 · Dec 2025

Amulet: Fast TEE-Shielded Inference for On-Device Model Protection

Zikai Mao, Lingchen Zhao, Lei Xu et al. · Wuhan University · Nanjing University of Science and Technology +1 more

Defends on-device ML model weights from extraction using TEE obfuscation, enabling GPU-accelerated inference with only 2 TEE interactions per request

Model Theft visionnlp
PDF
defense arXiv Nov 26, 2025 · Nov 2025

Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu, Liming Lu, Xu Zheng et al. · Nanjing University of Science and Technology · The Hong Kong University of Science and Technology (Guangzhou) +3 more

Defends 3D point cloud models against adversarial attacks via multimodal teacher-student prompt distillation with zero inference overhead

Input Manipulation Attack visionmultimodal
PDF Code
attack arXiv Oct 3, 2025 · Oct 2025

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more

Gradient-based untargeted jailbreak attack maximizes LLM unsafety probability without fixed response targets, achieving 80% ASR in 100 iterations

Input Manipulation Attack Prompt Injection nlp
2 citations PDF Code
defense arXiv Sep 25, 2025 · Sep 2025

FERD: Fairness-Enhanced Data-Free Robustness Distillation

Zhengxiao Li, Liming Lu, Xu Zheng et al. · Nanjing University of Science and Technology · HKUST(GZ) +3 more

Fairness-enhanced data-free distillation reduces per-class adversarial robustness disparity in student models via reweighted synthetic adversarial examples

Input Manipulation Attack vision
PDF
defense arXiv Sep 24, 2025 · Sep 2025

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Xiyu Zeng, Siyuan Liang, Liming Lu et al. · Nanjing University of Science and Technology · Nanyang Technological University +1 more

Inference-time SVD-based activation steering defends VLMs against visual jailbreaks while preserving utility and efficiency

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
defense arXiv Sep 16, 2025 · Sep 2025

CIARD: Cyclic Iterative Adversarial Robustness Distillation

Liming Lu, Shuchao Pang, Xu Zheng et al. · Nanjing University of Science and Technology · HKUST(GZ) +4 more

Defends lightweight student models against adversarial attacks via cyclic multi-teacher distillation with contrastive alignment and continuous adversarial retraining

Input Manipulation Attack vision
PDF Code
attack arXiv Aug 21, 2025 · Aug 2025

Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

Shuchao Pang, Zhenghan Chen, Shen Zhang et al. · Nanjing University of Science and Technology · Microsoft +2 more

Transfer-based black-box adversarial attack on 3D point clouds by corrupting shared critical features across DNN architectures

Input Manipulation Attack vision
PDF Code
defense arXiv Aug 17, 2025 · Aug 2025

Semantic Discrepancy-aware Detector for Image Forgery Identification

Ziye Wang, Minghang Yu, Chunyan Xu et al. · Nanjing University of Science and Technology · Beijing Normal University

Novel vision-language model-guided detector aligns forgery traces with semantic concepts to identify AI-generated forged images

Output Integrity Attack vision
PDF Code
defense arXiv Aug 6, 2025 · Aug 2025

Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks

Chengrui Sun, Hua Zhang, Haoran Gao et al. · Beijing University of Posts and Telecommunications · China Mobile Research Institute +2 more

Defends against adaptive backdoor attacks by isolating hidden triggers from benign features and applying unlearning-based model repair

Model Poisoning vision
PDF
defense arXiv Aug 1, 2025 · Aug 2025

DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification

Chihan Huang, Belal Alsinglawi, Islam Al-qudah · Zayed University · Nanjing University of Science and Technology +1 more

Distills diffusion purification into a latent consistency model enabling real-time adversarial input cleaning with SOTA robustness

Input Manipulation Attack vision
PDF