ML Security Papers

Latest papers

12 papers

attack arXiv Apr 7, 2026 · 6w ago

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

Syed Irfan Ali Meerza, Feiyi Wang, Jian Liu · University of Tennessee · Oak Ridge National Laboratory +1 more

Gradient-based attack reconstructing training data from federated LLMs at scale, working across architectures and PEFT methods

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

defense arXiv Apr 6, 2026 · 6w ago

XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

Jiahao Xu, Rui Hu, Olivera Kotevska et al. · University of Nevada · Oak Ridge National Laboratory

Multi-bit watermarking embedding binary messages in LLM text for attribution using cross-permutation green lists

Output Integrity Attack nlp

PDF Code

benchmark arXiv Apr 1, 2026 · 7w ago

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan et al. · George Mason University · Tulane University +2 more

Benchmark of 120 prompt injection attacks on personal AI agents across skill files, emails, and web content

Prompt Injection Excessive Agency nlpmultimodal

PDF

defense arXiv Apr 1, 2026 · 7w ago

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Zikai Zhang, Rui Hu, Olivera Kotevska et al. · University of Nevada · Oak Ridge National Laboratory

Detects LLM jailbreak attacks using logit distributions over numerical tokens, achieving 22.66% ASR reduction with minimal overhead

Prompt Injection nlp

PDF

attack arXiv Mar 19, 2026 · 9w ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack Vulnerability Discovery Red-Team Agents

PDF

defense arXiv Feb 23, 2026 · 12w ago

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry et al. · Washington University in St. Louis · Oak Ridge National Laboratory

Proposes SimLBR, a latent blending regularization framework for AI-generated image detection with strong cross-generator generalization

Output Integrity Attack visiongenerative

PDF

defense arXiv Feb 12, 2026 · Feb 2026

Community Concealment from Unsupervised Graph Learning-Based Clustering

Dalyapraz Manatova, Pablo Moriano, L. Jean Camp · Indiana University · Oak Ridge National Laboratory +1 more

Evades GNN community detection by perturbing graph edges and node features to conceal sensitive communities from unsupervised clustering

Input Manipulation Attack graph

PDF

attack arXiv Feb 6, 2026 · Feb 2026

The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

Haley Duba-Sullivan, Steven R. Young, Emma J. Reid · Oak Ridge National Laboratory

Trains adversarially-poisoned super-resolution models that silently cause downstream classifier misclassification without any input-level perturbations

Model Poisoning vision

PDF

tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

1 citations PDF

attack arXiv Nov 17, 2025 · Nov 2025

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

Farhin Farhad Riya, Shahinul Hoque, Jinyuan Stella Sun et al. · University of Tennessee · Oak Ridge National Laboratory

Federated learning poisoning attack that corrupts Grad-CAM saliency maps via color perturbations while preserving classification accuracy above 96%

Data Poisoning Attack visionfederated-learning

PDF

survey arXiv Oct 21, 2025 · Oct 2025

The Black Tuesday Attack: how to crash the stock market with adversarial examples to financial forecasting models

Thomas Hofweber, Jefrey Bergl, Ian Reyes et al. · University of North Carolina at Chapel Hill · Oak Ridge National Laboratory

Analyzes how adversarial input manipulations to ML financial forecasting models could trigger self-fulfilling, hard-to-detect stock market crashes

Input Manipulation Attack timeseries

PDF

defense arXiv Aug 12, 2025 · Aug 2025

Attacks and Defenses Against LLM Fingerprinting

Kevin Kurian, Ethan Holland, Sean Oesch · Oak Ridge National Laboratory

Improves LLM fingerprinting attacks with RL-optimized query selection and defends with semantic-preserving output filtering to hide model identity

Model Theft Model Theft nlp

PDF

Latest papers

FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

ClawSafety: "Safe" LLMs, Unsafe Agents

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Community Concealment from Unsupervised Graph Learning-Based Clustering

The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

The Black Tuesday Attack: how to crash the stock market with adversarial examples to financial forecasting models

Attacks and Defenses Against LLM Fingerprinting

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue