Latest papers

33 papers
survey arXiv Apr 30, 2026 · 21d ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Luyao Xu, Xiang Chen · Nantong University · Nanjing University

Layered security review of LLM agent frameworks covering prompt injection, tool misuse, state persistence attacks, and ecosystem vulnerabilities

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
defense arXiv Apr 30, 2026 · 21d ago

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang, Yuchen Chen, Weisong Sun et al. · Soochow University · Nanjing University +1 more

Dataset watermarking scheme embedding stealth marks in code via variable name patterns to prove training data ownership

Output Integrity Attack nlp
PDF
attack arXiv Apr 30, 2026 · 21d ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Zi Li, Tian Zhou, Wenze Li et al. · Nanjing University

Malicious model code backdoors that hijack fine-tuning to force memorization and extraction of high-entropy secrets like API keys

AI Supply Chain Attacks Model Inversion Attack Model Poisoning Sensitive Information Disclosure nlp
PDF
defense arXiv Apr 24, 2026 · 27d ago

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Yuan Xiao, Jiaming Wang, Yuchen Chen et al. · Nanjing University · University of New South Wales +3 more

Dataset poisoning defense that injects compilable, functionality-preserving code fragments to degrade CodeLLM training with only 10% contamination

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Apr 12, 2026 · 5w ago

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

Yuchen Chen, Yuan Xiao, Chunrong Fang et al. · Nanjing University

Embeds ownership watermarks in code training datasets using AST-based style triggers plus poisoned samples that degrade model performance if watermark is removed

Output Integrity Attack Model Poisoning nlp
PDF
defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision
PDF
defense arXiv Mar 25, 2026 · 8w ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp
PDF
defense arXiv Mar 2, 2026 · 11w ago

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

Yu Lin, Qizhi Zhang, Wenqiang Ruan et al. · ByteDance · Nanjing University

Defends user input privacy in cloud LLM inference by obfuscating activations to resist internal state inversion attacks

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Feb 18, 2026 · Feb 2026

SRFed: Mitigating Poisoning Attacks in Privacy-Preserving Federated Learning with Heterogeneous Data

Yiwen Lu · Nanjing University

Defends federated learning against Byzantine poisoning and server-side gradient inference attacks using functional encryption and clustering-based aggregation

Data Poisoning Attack Model Inversion Attack federated-learning
PDF
defense arXiv Feb 12, 2026 · Feb 2026

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan, Jian Liang, Ran He et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Defends against LLM attribute inference attacks using fine-grained anonymization and adversarial suffix optimization to induce model rejection

Sensitive Information Disclosure nlp
1 citations PDF Code
attack arXiv Feb 6, 2026 · Feb 2026

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Haoyang Hu, Zhejun Jiang, Yueming Lyu et al. · The University of Hong Kong · Nanjing University +1 more

Fine-tunes an LLM as a poison generator to inject robust, chunking-aware malicious content into RAG knowledge bases

Data Poisoning Attack Prompt Injection nlp
PDF
defense arXiv Feb 1, 2026 · Feb 2026

Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts

Songping Wang, Qinglong Liu, Yueming Lyu et al. · Nanjing University · Ltd. +1 more

Proposes component-level adversarial attacks and defenses targeting routers and experts in video MoE models

Input Manipulation Attack vision
1 citations PDF
defense arXiv Feb 1, 2026 · Feb 2026

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu et al. · Nanjing University of Science and Technology · National University of Singapore +2 more

Identifies sparse cross-lingual safety neurons in LLMs and proposes targeted fine-tuning to close multilingual jailbreak safety gaps

Prompt Injection nlp
PDF Code
defense arXiv Jan 29, 2026 · Jan 2026

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu et al. · The University of Sydney · Nanjing University of Science and Technology +2 more

Defends LLMs, diffusion models, and MLLMs from jailbreaks by tracing and severing harmful semantic circuits via sparse autoencoders and causal path analysis

Input Manipulation Attack Prompt Injection nlpvisionmultimodalgenerative
PDF
defense arXiv Jan 29, 2026 · Jan 2026

AtPatch: Debugging Transformers via Hot-Fixing Over-Attention

Shihao Weng, Yang Feng, Jincheng Li et al. · Nanjing University · Singapore Management University

Inference-time defense that neutralizes backdoor triggers in transformers by detecting and redistributing anomalous attention maps without modifying weights

Model Poisoning visionnlp
PDF
defense arXiv Jan 23, 2026 · Jan 2026

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang, Xianying Luo, Yadong Wang et al. · Nanjing University of Aeronautics and Astronautics · Tsinghua University +3 more

Adaptive three-stage LLM defense routes inputs by risk level to counter jailbreaks and prefilling attacks without sacrificing utility

Prompt Injection nlp
PDF
attack arXiv Jan 18, 2026 · Jan 2026

Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Yi Qian, Kunwei Qian, Xingbang He et al. · Nanjing University · Ltd +1 more

Attacks VLM-powered Android GUI agents by hijacking UI state between observation and action, achieving 100% success with zero permissions

Prompt Injection Excessive Agency multimodal
PDF
attack arXiv Dec 24, 2025 · Dec 2025

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

Haoyang Li, Mingjin Li, Jinxin Zuo et al. · Beijing University of Posts and Telecommunications · Chinese Academy of Sciences +3 more

Adversarial code obfuscation framework that exploits CoT reasoning chain weaknesses to evade LLM-based vulnerability detectors

Input Manipulation Attack Prompt Injection nlp
PDF Code
benchmark TrustCom Dec 17, 2025 · Dec 2025

Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference

Chenxiang Zhang, Tongxi Qu, Zhong Li et al. · University of Luxembourg · Nanjing University

Evaluates how post-training quantization affects membership inference vulnerability, finding 1.58-bit models leak an order of magnitude less than full-precision

Membership Inference Attack vision
PDF
attack arXiv Dec 10, 2025 · Dec 2025

Reference Recommendation based Membership Inference Attack against Hybrid-based Recommender Systems

Xiaoxiao Chi, Xuyun Zhang, Yan Wang et al. · Macquarie University · The University of Newcastle +1 more

Novel metric-based membership inference attack against hybrid recommender systems using reference recommendations to infer user training membership

Membership Inference Attack tabular
PDF
Loading more papers…