ML Security Papers

Latest papers

5 papers

attack arXiv Apr 9, 2026 · 6w ago

Wenpeng Xing, Moran Fang, Guangtai Wang et al. · Zhejiang University · Binjiang Institute of Zhejiang University +1 more

Inference-time jailbreak attack that surgically ablates safety guardrails by suppressing refusal-inducing activation patterns in LLM hidden states

Prompt Injection nlp

defense arXiv Apr 7, 2026 · 6w ago

Haobo Zhang, Zhenhua Xu, Junxian Li et al. · Zhejiang University of Technology · Binjiang Institute of Zhejiang University +3 more

White-box LLM fingerprinting via differential attention patterns robust to fine-tuning, pruning, and merging for provenance verification

Model Theft Model Theft nlp

benchmark arXiv Jan 26, 2026 · Jan 2026

Dezhang Kong, Zhuxi Wu, Shiqi Liu et al. · Zhejiang University · National University of Malaysia +4 more

Benchmark revealing LLM web agents fail to detect disguised malicious URLs across 61K attack instances in 10 real-world scenarios

Prompt Injection nlp

defense arXiv Jan 13, 2026 · Jan 2026

Zhenhua Xu, Yiran Zhao, Mengting Zhong et al. · Zhejiang University · Binjiang Institute of Zhejiang University +3 more

Hierarchical backdoor fingerprinting embeds nested stylistic and semantic triggers in LLMs to prove ownership against black-box theft

Model Theft Model Theft nlp

3 citations PDF Code

defense arXiv Aug 14, 2025 · Aug 2025

Wenpeng Xing, Zhonghao Qi, Yupeng Qin et al. · Zhejiang University · Binjiang Institute of Zhejiang University +3 more

Defends LLM-tool MCP interfaces from prompt injection and data exfiltration via a three-stage neural detection pipeline

Insecure Plugin Design Prompt Injection nlp