ML Security Papers

Latest papers

233 papers

attack arXiv Apr 30, 2026 · 21d ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Zi Li, Tian Zhou, Wenze Li et al. · Nanjing University

Malicious model code backdoors that hijack fine-tuning to force memorization and extraction of high-entropy secrets like API keys

AI Supply Chain Attacks Model Inversion Attack Model Poisoning Sensitive Information Disclosure nlp

PDF

attack arXiv Apr 26, 2026 · 25d ago

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Yu Cui, Ruiqing Yue, Hang Fu et al. · Beijing Institute of Technology · Chinese Academy of Sciences +3 more

Extracts private information from LLM agent memory via single-query hybrid probing in black-box and gray-box settings

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Apr 26, 2026 · 25d ago

LLM-CEG: Extending the Classification Error Gauge Framework for Privacy Auditing of Large Language Models

Kato Mivule · Bowie State University

Privacy auditing framework for LLMs measuring membership inference attack resistance and utility trade-offs under differential privacy

Membership Inference Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Apr 23, 2026 · 28d ago

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Xiaoyi Chen, Haoyuan Wang, Siyuan Tang et al. · Indiana University Bloomington · Independent Researcher +3 more

Evaluation framework exposing weaknesses in LLM privacy unlearning through three-tier attacks: direct retrieval, in-context recovery, and fine-tuning restoration

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Apr 23, 2026 · 28d ago

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

Zihan Wang, Rui Zhang, Yu Liu et al. · University of Electronic Science and Technology of China

Black-box attacks extract proprietary LLM agent skills in 3 interactions; defenses tested but low-cost repeated attacks remain effective

Sensitive Information Disclosure Prompt Injection nlp

PDF

defense arXiv Apr 22, 2026 · 29d ago

Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks

Pranav Pallerla, Wilson Naik Bhukya, Bharath Vemula et al. · University of Hyderabad · Purdue University

Adaptive defense orchestration for RAG systems that selectively activates protections based on query risk, reducing utility cost while defending against membership inference and data poisoning

Membership Inference Attack Data Poisoning Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 21, 2026 · 4w ago

An AI Agent Execution Environment to Safeguard User Data

Robert Stanley, Avi Verma, Lillian Tsai et al. · University of California · Google

Information flow control system for AI agents that blocks prompt injection data exfiltration attacks while enforcing user privacy policies

Prompt Injection Sensitive Information Disclosure Excessive Agency nlp

PDF

benchmark arXiv Apr 20, 2026 · 4w ago

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Ruixuan Liu, David Evans, Li Xiong · Emory University · University of Virginia

Formalizes extraction risk measurement for LLM APIs, showing indistinguishability bounds don't prevent data extraction attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Apr 19, 2026 · 4w ago

Representation-Guided Parameter-Efficient LLM Unlearning

Zeguan Xiao, Lang Mo, Yun Chen et al. · Shanghai University of Finance and Economics · Southern University of Science and Technology +1 more

LoRA-based LLM unlearning using representation geometry to remove knowledge while preserving utility, evaluated on TOFU and WMDP

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

survey arXiv Apr 17, 2026 · 4w ago

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Zehao Lin, Chunyu Li, Kai Chen · MemTensor

Surveys security risks in LLM agent long-term memory across write/retrieve/share/forget phases, proposing mnemonic sovereignty framework

Prompt Injection Sensitive Information Disclosure Excessive Agency nlp

PDF

defense arXiv Apr 16, 2026 · 5w ago

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

Yisheng Zhong, Sijia Liu, Zhuangdi Zhu · George Mason University · Michigan State University

Multi-objective LLM unlearning framework that removes hazardous knowledge while defending against adversarial probing attacks via bidirectional distillation

Model Inversion Attack Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 14, 2026 · 5w ago

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi et al. · Indian Institute of Technology Mandi

User-driven LLM unlearning via natural language prompts using training-free activation steering to remove harmful knowledge at inference time

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 12, 2026 · 5w ago

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Eric Easley, Sebastian Farquhar · University of California · University of Oxford

Defense training LLMs to reinterpret malicious instructions as benign at the representation level, blocking jailbreaks and backdoors

Model Poisoning Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 12, 2026 · 5w ago

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Yuanbo Xie, Yingjie Zhang, Yulin Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +4 more

Runtime defense that embeds canary tokens in RAG-retrieved content to detect knowledge base leakage attacks in real-time

Sensitive Information Disclosure Prompt Injection nlp

PDF

attack arXiv Apr 10, 2026 · 5w ago

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He, Ning Wang et al. · University of Massachusetts Lowell · Virginia Tech +5 more

Adaptive query-based attack extracting private data from LLM agent memory, achieving 100% success via entropy-guided distribution estimation

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

survey arXiv Apr 9, 2026 · 6w ago

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Yuming Xu, Mingtao Zhang, Zhuohan Ge et al. · The Hong Kong Polytechnic University · The Hong Kong University of Science and Technology

Surveys RAG-specific security threats across knowledge corruption, retrieval manipulation, context exploitation, and exfiltration attacks

Prompt Injection Sensitive Information Disclosure nlp

PDF

attack arXiv Apr 9, 2026 · 6w ago

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Hanzhi Liu, Chaofan Shou, Hongbo Wen et al. · University of California · Fuzzland +1 more

Malicious LLM API routers inject code into tool calls and steal credentials from agent frameworks in the wild

AI Supply Chain Attacks Insecure Plugin Design Sensitive Information Disclosure nlp

PDF

benchmark arXiv Apr 7, 2026 · 6w ago

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

Fatih Uenal · University of Colorado Boulder

Benchmark evaluating LLM security across prompt injection, PII extraction, and system prompt leakage for Swiss regulatory compliance

Prompt Injection Sensitive Information Disclosure nlp

PDF

attack arXiv Apr 7, 2026 · 6w ago

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

Syed Irfan Ali Meerza, Feiyi Wang, Jian Liu · University of Tennessee · Oak Ridge National Laboratory +1 more

Gradient-based attack reconstructing training data from federated LLMs at scale, working across architectures and PEFT methods

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

attack arXiv Apr 7, 2026 · 6w ago

Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

Wuyang Zhang, Shichao Pei · University of Massachusetts Boston

Backdoor attack on LLM agents that exfiltrates user data through disguised tool calls triggered by semantic prompts

Model Poisoning Sensitive Information Disclosure Insecure Plugin Design nlp

PDF

Loading more papers…

Latest papers

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

LLM-CEG: Extending the Classification Error Gauge Framework for Privacy Auditing of Large Language Models

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks

An AI Agent Execution Environment to Safeguard User Data

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Representation-Guided Parameter-Efficient LLM Unlearning

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue