ML Security Papers

Latest papers

12 papers

attack arXiv Apr 9, 2026 · 6w ago

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Wenpeng Xing, Moran Fang, Guangtai Wang et al. · Zhejiang University · Binjiang Institute of Zhejiang University +1 more

Inference-time jailbreak attack that surgically ablates safety guardrails by suppressing refusal-inducing activation patterns in LLM hidden states

Prompt Injection nlp

PDF

defense arXiv Jan 19, 2026 · Jan 2026

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

Zhenhua Xu, Xiaoning Tian, Wenjun Zeng et al. · Zhejiang University · GenTel.io +4 more

Defends LLM IP by embedding kinship-narrative knowledge into model weights for stealthy, robust ownership verification

Model Theft Model Theft nlp

PDF Code

defense arXiv Jan 13, 2026 · Jan 2026

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection

Zhenhua Xu, Yiran Zhao, Mengting Zhong et al. · Zhejiang University · Binjiang Institute of Zhejiang University +3 more

Hierarchical backdoor fingerprinting embeds nested stylistic and semantic triggers in LLMs to prove ownership against black-box theft

Model Theft Model Theft nlp

3 citations PDF Code

defense arXiv Jan 13, 2026 · Jan 2026

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

Zhenhua Xu, Haobo Zhang, Zhebo Wang et al. · Zhejiang University · GenTel.io +1 more

Fingerprints LLMs for ownership verification using targeted unlearning to embed stealthy, trigger-free provenance traces

Model Theft nlp

2 citations PDF Code

defense arXiv Sep 5, 2025 · Sep 2025

CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

Zhenhua Xu, Xixiang Zhao, Xubin Yue et al. · Zhejiang University · The Hong Kong Polytechnic University +1 more

Embeds verifiable LLM ownership fingerprints via multi-turn contextual backdoors resistant to perplexity detection and adversarial fine-tuning

Model Theft Model Theft nlp

PDF Code

defense arXiv Sep 3, 2025 · Sep 2025

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint

Zhenhua Xu, Meng Han, Wenpeng Xing · Zhejiang University · GenTel.io

Detects stolen LLMs via memorization-based probabilistic fingerprints that remain stealthy and robust under gray-box API access

Model Theft Model Theft nlp

PDF Code

attack arXiv Sep 1, 2025 · Sep 2025

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Dezhang Kong, Hujin Peng, Yilun Zhang et al. · Zhejiang University · Changsha University of Science and Technology +4 more

Attacks LLM multi-agent systems via manipulated web links using homoglyph, subdirectory, and obfuscation techniques

Insecure Plugin Design Excessive Agency nlp

PDF Code

defense arXiv Aug 31, 2025 · Aug 2025

Unlocking the Effectiveness of LoRA-FP for Seamless Transfer Implantation of Fingerprints in Downstream Models

Zhenhua Xu, Zhaokun Yan, Binhan Xu et al. · Zhejiang University · China Academy of Information and Communications Technology +3 more

Embeds backdoor ownership fingerprints into LoRA adapters for lightweight, transferable LLM IP protection across downstream models

Model Theft Model Theft nlp

PDF Code

defense arXiv Aug 31, 2025 · Aug 2025

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement

Xubin Yue, Zhenhua Xu, Wenpeng Xing et al. · Zhejiang University · GenTel.io +1 more

Embeds ownership fingerprints in LLM parameter offsets via dual-channel knowledge editing, resisting fine-tuning erasure and feature-space defenses

Model Theft Model Theft nlp

PDF

survey arXiv Aug 15, 2025 · Aug 2025

Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

Zhenhua Xu, Xubin Yue, Zhebo Wang et al. · Zhejiang University · GenTel.io

Surveys LLM copyright protection: text watermarking, model fingerprinting, fingerprint transfer/removal, and IP ownership verification

Model Theft Output Integrity Attack Model Theft nlp

PDF Code

defense arXiv Aug 14, 2025 · Aug 2025

MCP-Guard: A Multi-Stage Defense-in-Depth Framework for Securing Model Context Protocol in Agentic AI

Wenpeng Xing, Zhonghao Qi, Yupeng Qin et al. · Zhejiang University · Binjiang Institute of Zhejiang University +3 more

Defends LLM-tool MCP interfaces from prompt injection and data exfiltration via a three-stage neural detection pipeline

Insecure Plugin Design Prompt Injection nlp

PDF

attack arXiv Aug 8, 2025 · Aug 2025

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu et al. · Bingjiang Institute of Zhejiang University · Zhejiang University +3 more

White-box jailbreak fuses harmful and benign hidden states in latent space to bypass LLM safety alignment with 94% ASR

Input Manipulation Attack Prompt Injection nlp

PDF

Latest papers

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Unlocking the Effectiveness of LoRA-FP for Seamless Transfer Implantation of Fingerprints in Downstream Models

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement

Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

MCP-Guard: A Multi-Stage Defense-in-Depth Framework for Securing Model Context Protocol in Agentic AI

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue