ML Security Papers

Latest papers

13 papers

attack arXiv Mar 11, 2026 · 26d ago

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Zixun Xiong, Gaoyi Wu, Lingfeng Yao et al. · Stevens Institute of Technology · University of Houston

Attacks LLM multi-agent topology confidentiality by inferring full network structure from a single compromised agent's context using jailbreak and diffusion-based inference

Excessive Agency Prompt Injection nlp

PDF

attack arXiv Feb 28, 2026 · 5w ago

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Ci Zhang, Zhaojun Ding, Chence Yang et al. · University of Georgia · Carnegie Mellon University +3 more

Attacks pruning-based unlearning in diffusion models by reviving erased concepts via side-channel signals from zeroed weight locations

Output Integrity Attack generativevision

PDF

defense Quantum Machine Intelligence Jan 26, 2026 · 10w ago

Differentiable Architecture Search for Adversarially Robust Quantum Computer Vision

Mohamed Afane, Quanjiang Long, Haoting Shen et al. · Fordham University · Zhejiang University +2 more

Defends quantum neural networks against adversarial attacks via differentiable architecture search with trainable classical noise preprocessing

Input Manipulation Attack vision

PDF

tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection nlp

1 citations PDF

defense arXiv Nov 12, 2025 · Nov 2025

iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification

Zixun Xiong, Gaoyi Wu, Qingyang Yu et al. · Stevens Institute of Technology · Genentech +1 more

Defends LLM IP with encrypted fingerprinting that resists collusion-based unlearning and response manipulation attacks at verification time

Model Theft Model Theft nlp

PDF Code

defense arXiv Nov 4, 2025 · Nov 2025

Verifying LLM Inference to Detect Model Weight Exfiltration

Roy Rinberg, Adam Karvonen, Alexander Hoover et al. · Harvard University · ML Alignment & Theory Scholars (MATS) +2 more

Defends against LLM weight theft via steganographic output channels by verifying inference non-determinism, achieving >200x adversary slowdown

Model Theft Model Theft nlp

2 citations PDF

defense ICDMW Oct 26, 2025 · Oct 2025

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav, Zining Zhu · Edison Academy Magnet School · Stevens Institute of Technology

Uses SAE feature steering guided by contrasting safe/unsafe prompt pairs to improve LLM refusal of harmful prompts without sacrificing utility

Prompt Injection nlp

PDF

attack arXiv Oct 21, 2025 · Oct 2025

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Kuai Yu, Xiaoyu Wu, Peishen Yan et al. · Columbia University · Shanghai Jiao Tong University +4 more

Uses reinforcement learning to optimize layer selection for stealthy backdoor attacks in federated learning, beating SOTA defenses by 40%

Model Poisoning federated-learning

PDF

benchmark arXiv Sep 5, 2025 · Sep 2025

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita · Stevens Institute of Technology · University of British Columbia +1 more

Benchmarks camouflaged natural-language jailbreaks on LLMs with 500-prompt dataset and 7-dimension harmfulness evaluation framework

Prompt Injection nlp

PDF

defense In Proceedings of the 32nd ACM... Sep 5, 2025 · Sep 2025

Safeguarding Graph Neural Networks against Topology Inference Attacks

Jie Fu, Yuan Hong, Zhili Chen et al. · Stevens Institute of Technology · University of Connecticut +1 more

Proposes graph topology reconstruction attacks on GNNs and a bi-level optimization defense to prevent training data leakage

Model Inversion Attack graph

PDF Code

defense arXiv Aug 25, 2025 · Aug 2025

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Guangwei Zhang, Qisheng Su, Jiateng Liu et al. · City University of Hong Kong · Microsoft +4 more

Proactive LLM defense inspects internal states pre-generation to intercept copyrighted training data before disclosure

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Aug 16, 2025 · Aug 2025

CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection

Yue Wang, Liesheng Wei, Yuxiang Wang · Stanford University · Ocean University +1 more

Multi-agent LLM framework detects AI-generated text via adversarial cross-dimensional linguistic consistency probing

Output Integrity Attack nlp

PDF

attack arXiv Aug 14, 2025 · Aug 2025

Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models

Taibiao Zhao, Mingxuan Sun, Hao Wang et al. · Louisiana State University · Stevens Institute of Technology

Retraining-free backdoor attack on transformers via attention head pruning and malicious head injection, achieving 99.55% ASR and evading four defenses

Model Poisoning visionnlp

PDF

Latest papers

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Differentiable Architecture Search for Adversarially Robust Quantum Computer Vision

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification

Verifying LLM Inference to Detect Model Weight Exfiltration

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Safeguarding Graph Neural Networks against Topology Inference Attacks

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection

Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue