ML Security Papers

Latest papers

9 papers

defense arXiv Apr 20, 2026 · 4w ago

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal · University of Virginia

Sparse Autoencoders as inference-time jailbreak defense, achieving 5x attack success reduction via representational bottleneck

Input Manipulation Attack Prompt Injection nlp

PDF

benchmark arXiv Apr 20, 2026 · 4w ago

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Ruixuan Liu, David Evans, Li Xiong · Emory University · University of Virginia

Formalizes extraction risk measurement for LLM APIs, showing indistinguishability bounds don't prevent data extraction attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Feb 24, 2026 · 12w ago

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar et al. · University of Virginia · Capital One

Defends LLMs against jailbreaks by training reasoning-aware refusals via CoT datasets and segment-weighted DPO

Prompt Injection nlp

PDF

defense arXiv Feb 21, 2026 · 12w ago

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo et al. · Zhejiang University · University of Virginia +2 more

Watermarks LLM agent training trajectories so models trained on stolen datasets emit detectable hook behaviors under a secret key

Output Integrity Attack nlpreinforcement-learning

PDF Code

defense arXiv Jan 21, 2026 · Jan 2026

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Saswat Das, Ferdinando Fioretto · University of Virginia

Activation-space guardrail detects privacy-violating LLM agent prompts using linear probes and cumulative drift across multi-turn conversations

Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Dec 23, 2025 · Dec 2025

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang et al. · Westcliff University · University of Washington +3 more

Defends federated learning against Byzantine poisoning attacks using Shapley-based reputation scores while minimizing multi-cloud communication costs

Data Poisoning Attack federated-learningvision

PDF

defense arXiv Sep 8, 2025 · Sep 2025

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Yan Pang, Wenlong Meng, Xiaojing Liao et al. · University of Virginia · Indiana University Bloomington

Instruments LLMs with trigger-tag associations so phishing-generating models automatically embed detectable markers in harmful outputs

Output Integrity Attack Prompt Injection nlp

PDF

attack arXiv Aug 12, 2025 · Aug 2025

Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun · University of Virginia

Evaluates four families of character-level special-character attacks on 7 open-source LLMs to bypass safety mechanisms via jailbreaking

Prompt Injection nlp

PDF Code

attack arXiv Aug 10, 2025 · Aug 2025

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Wei Qian, Chenxu Zhao, Yangyi Li et al. · Iowa State University · University of Virginia

Proposes attacks that exploit machine unlearning requests to covertly corrupt model uncertainty estimates without altering predicted labels

Data Poisoning Attack vision

PDF

Latest papers

Towards Understanding the Robustness of Sparse Autoencoders

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Watermarking LLM Agent Trajectories

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Special-Character Adversarial Attacks on Open-Source Language Model

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue