Latest papers

5 papers
attack arXiv Apr 14, 2026 · 5w ago

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu et al. · Zhejiang University · Palo Alto Networks +3 more

Stealthy LLM backdoor injection via weight editing that compiles activation steering into null-space constraints for reliable jailbreaks

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp
PDF
benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Jan 9, 2026 · Jan 2026

HogVul: Black-box Adversarial Code Generation Framework Against LM-based Vulnerability Detectors

Jingxiao Yang, Ping He, Tianyu Du et al. · Zhejiang University · Ningbo Global Innovation Center +1 more

Black-box PSO-optimized adversarial attack on LM-based vulnerability detectors using combined lexical and syntax code perturbations

Input Manipulation Attack nlp
PDF
defense arXiv Nov 13, 2025 · Nov 2025

Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging

Qinfeng Li, Miao Pan, Jintao Chen et al. · Zhejiang University · Ningbo Global Innovation Center +2 more

Defends open-source LLMs from unauthorized model merging by disrupting Linear Mode Connectivity between homologous model weights

Model Theft Model Theft nlp
1 citations PDF
defense arXiv Nov 13, 2025 · Nov 2025

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

Qinfeng Li, Miao Pan, Ke Xiong et al. · Zhejiang University · Ant Group +3 more

Defends RAG systems against proprietary knowledge base extraction attacks using dual-path contrastive reindexing and constrained cascade generation

Sensitive Information Disclosure nlp
PDF Code