Latest papers

23 papers
defense arXiv Apr 28, 2026 · 23d ago

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · University of Science and Technology of China +2 more

Lightweight detector that identifies prompt injection attacks in web agent screenshots using visual gradient analysis and text recovery

Prompt Injection Excessive Agency multimodalnlp
PDF
attack arXiv Apr 23, 2026 · 28d ago

Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Guilin Deng, Silong Chen, Yuchuan Luo et al. · National University of Defense Technology · City University of Hong Kong +1 more

Gradient-based membership inference attack on federated LLMs achieving near-perfect accuracy via projection residual analysis

Membership Inference Attack nlpfederated-learning
PDF Code
defense arXiv Apr 14, 2026 · 5w ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al. · Hunan Normal University · Chinese Academy of Sciences +1 more

Couples weight subspace constraints with activation regularization to prevent safety degradation during LLM fine-tuning

Prompt Injection nlp
PDF
attack arXiv Apr 8, 2026 · 6w ago

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan et al. · National University of Defense Technology · Alibaba Group +2 more

Backdoor attack embedding encrypted malicious payloads in agent skills, activated by triggers during skill composition

Model Poisoning Excessive Agency nlp
PDF
attack arXiv Apr 3, 2026 · 6w ago

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Chengyin Hu, Yuxian Dong, Yikun Guo et al. · National University of Defense Technology

Universal physical adversarial patches that disrupt semantic alignment in infrared vision-language models across classification, captioning, and VQA tasks

Input Manipulation Attack Prompt Injection multimodalvision
PDF
defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision
PDF
defense arXiv Mar 16, 2026 · 9w ago

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Ruyi Zhang, Heng Gao, Songlei Jian et al. · National University of Defense Technology

LLM-powered trigger generator using reinforcement learning to detect and remove backdoors in NLP models via adversarial training

Model Poisoning nlp
PDF Code
attack arXiv Mar 4, 2026 · 11w ago

LEA: Label Enumeration Attack in Vertical Federated Learning

Wenhao Jiang, Shaojing Fu, Yuchuan Luo et al. · National University of Defense Technology

Infers private labels in vertical federated learning by enumerating label permutations and comparing gradient cosine similarity, without auxiliary data

Model Inversion Attack federated-learning
PDF
defense arXiv Feb 6, 2026 · Feb 2026

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · National University of Singapore +1 more

Proactive fine-tuning defense traps gradient-based jailbreak suffixes or fingerprints them, cutting LLM attack success below 0.01%

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Jan 24, 2026 · Jan 2026

Reconstructing Training Data from Adapter-based Federated Large Language Models

Silong Chen, Yuchuan Luo, Guilin Deng et al. · National University of Defense Technology · City University of Hong Kong

Gradient inversion attack reconstructs training text from LoRA adapter gradients in federated LLMs achieving ROUGE-1/2 over 99

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning
PDF Code
defense arXiv Jan 20, 2026 · Jan 2026

MirageNet:A Secure, Efficient, and Scalable On-Device Model Protection in Heterogeneous TEE and GPU System

Huadi Zheng, Li Cheng, Yan Ding · National University of Defense Technology

Defends edge-deployed DNN model IP from theft via TEE-GPU obfuscation, cutting overhead 16% versus GroupCover

Model Theft vision
PDF
defense arXiv Jan 6, 2026 · Jan 2026

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang, Songlei Jian, Shasha Li et al. · National University of Defense Technology

Defends LLMs against jailbreaks by unlearning dynamic information paths that reassemble harmful outputs, not just isolated parameters

Prompt Injection nlp
PDF
attack arXiv Dec 25, 2025 · Dec 2025

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Tian Li, Bo Lin, Shangwen Wang et al. · National University of Defense Technology

Backdoors RACG retrievers to inject vulnerable code into LLM context, achieving 40%+ vulnerable code generation while bypassing defenses

Model Poisoning Prompt Injection nlpgenerative
PDF
attack arXiv Dec 16, 2025 · Dec 2025

Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Xingfu Zhou, Pengfei Wang · National University of Defense Technology

Poisons LLM agent reasoning by style-transferring retrieved docs into pathological tones, bypassing content filters without altering facts

Prompt Injection nlp
2 citations PDF
attack TDSC Dec 16, 2025 · Dec 2025

Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix

Wei Tao, Sheng Long, Xin Liu et al. · National University of Defense Technology · Academy of Military Science +3 more

AdaMI: momentum-based adaptive matrix attack that provably improves adversarial transferability over PGD and MI-FGSM across networks

Input Manipulation Attack vision
PDF
defense medRxiv Dec 5, 2025 · Dec 2025

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Jiale Zhao, Xing Mou, Jinlin Wu et al. · National University of Defense Technology · Chinese Academy of Sciences +3 more

Defends Medical MLLMs against cross-modality jailbreaks by grafting safety knowledge from base models during fine-tuning via parameter-space intervention

Transfer Learning Attack Prompt Injection multimodalvisionnlp
PDF
benchmark arXiv Nov 22, 2025 · Nov 2025

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Yunyi Zhang, Shibo Cui, Baojun Liu et al. · Tsinghua University · National University of Defense Technology +1 more

Discovers LLM apps routinely exceed intended capability boundaries, with 17 apps performing malicious tasks without any adversarial prompting

Excessive Agency Prompt Injection nlp
PDF
attack Chinese Conference on Pattern ... Nov 10, 2025 · Nov 2025

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Yulin Chen, Zeyuan Wang, Tianyuan Yu et al. · National University of Defense Technology

Gradient-based adversarial framework fools CLIP image-quality metrics, then detects tampered images via grayscale color-channel sensitivity

Input Manipulation Attack visionmultimodal
PDF
defense arXiv Oct 9, 2025 · Oct 2025

Provably Robust Adaptation for Language-Empowered Foundation Models

Yuni Lai, Xiaoyu Xue, Linghui Shen et al. · The Hong Kong Polytechnic University · National University of Defense Technology +2 more

Certifiably robust few-shot classifier for CLIP/GraphCLIP using trimmed-mean prototypes and randomized smoothing against support-set poisoning

Data Poisoning Attack visiongraphmultimodal
1 citations PDF
attack arXiv Oct 1, 2025 · Oct 2025

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang et al. · Beihang University · University of Science and Technology of China +3 more

Attacks concept erasure safety in Flux T2I models by exploiting attention localization, reactivating suppressed content via a 3.57 MB plug-and-play adapter

Input Manipulation Attack visiongenerative
1 citations PDF Code
Loading more papers…