ML Security Papers

Latest papers

19 papers

defense arXiv Apr 24, 2026 · 27d ago

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Yuan Xiao, Jiaming Wang, Yuchen Chen et al. · Nanjing University · University of New South Wales +3 more

Dataset poisoning defense that injects compilable, functionality-preserving code fragments to degrade CodeLLM training with only 10% contamination

Data Poisoning Attack Training Data Poisoning nlp

PDF

attack arXiv Apr 17, 2026 · 4w ago

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Jaechul Roh, Amir Houmansadr · University of Massachusetts Amherst

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs, achieving 87% jailbreak success through proximity to harmful embeddings

Transfer Learning Attack Prompt Injection audiomultimodalnlp

PDF

attack arXiv Apr 9, 2026 · 6w ago

Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction

Diana Romero, Mutahar Ali, Momin Ahmad Khan et al. · University of California · University of Massachusetts Amherst

Backdoor attacks on vision-language scanpath prediction models that redirect gaze fixations to attacker-chosen targets while evading detection

Model Poisoning visionmultimodal

PDF

attack arXiv Apr 3, 2026 · 6w ago

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Yuheng Zhang, Mingyue Huo, Minghao Zhu et al. · University of Illinois Urbana-Champaign · University of Massachusetts Amherst

Token-space adversarial attack on RLHF reward models that bypasses semantic constraints to generate nonsensical high-reward outputs

Input Manipulation Attack nlp

PDF

attack arXiv Mar 16, 2026 · 9w ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury et al. · Virginia Tech · Penn State University +2 more

Jailbreak injection during test-time RL amplifies LLM harmful outputs and degrades reasoning performance simultaneously

Prompt Injection Training Data Poisoning nlp

PDF

defense arXiv Mar 3, 2026 · 11w ago

Understanding and Mitigating Dataset Corruption in LLM Steering

Cullen Anderson, Narmeen Oozeer, Foad Namjoo et al. · University of Massachusetts Amherst · Martian AI +2 more

Analyzes adversarial data poisoning of LLM contrastive steering datasets and defends with robust mean estimation

Data Poisoning Attack Training Data Poisoning nlp

PDF

attack arXiv Feb 9, 2026 · Feb 2026

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Jan 30, 2026 · Jan 2026

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok et al. · University of Massachusetts Amherst

Benchmarks domain-level LLM misalignment susceptibility from insecure fine-tuning and backdoor triggers, ranking 11 domains from 0% to 87.67% vulnerability

Transfer Learning Attack Model Poisoning nlp

PDF Code

defense arXiv Jan 24, 2026 · Jan 2026

Improving User Privacy in Personalized Generation: Client-Side Retrieval-Augmented Modification of Server-Side Generated Speculations

Alireza Salemi, Hamed Zamani · University of Massachusetts Amherst

Privacy-preserving LLM personalization framework keeping user profiles client-side while resisting attribute inference and linkability attacks

Sensitive Information Disclosure nlp

PDF

attack arXiv Jan 14, 2026 · Jan 2026

Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh, Yuefeng Peng, Anshuman Suri et al. · University of Massachusetts Amherst · Northeastern University

Attacks T2I leaderboard anonymity by clustering model outputs in embedding space, deanonymizing 22 models from 150K images

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Jan 12, 2026 · Jan 2026

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

Weipeng Jiang, Xiaoyu Zhang, Juan Zhai et al. · Xi’an Jiaotong University · Nanyang Technological University +1 more

Discovers ASCII emoticons in prompts cause >38% semantic confusion in LLMs, producing syntactically valid but destructive silent failures in code generation

Prompt Injection nlp

PDF

defense arXiv Jan 9, 2026 · Jan 2026

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari et al. · University of Massachusetts Amherst

Evaluates memory poisoning attacks on EHR LLM agents and proposes trust-scored I/O moderation and memory sanitization defenses

Prompt Injection nlp

1 citations PDF Code

defense arXiv Jan 8, 2026 · Jan 2026

Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

Arghyadeep Das, Sai Sreenivas Chintha, Rishiraj Girmal et al. · University of Massachusetts Amherst

Defends against PII leakage in LLM chain-of-thought reasoning via prompt engineering and privacy-aware fine-tuning

Sensitive Information Disclosure nlp

1 citations PDF

attack arXiv Oct 20, 2025 · Oct 2025

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar · Inc. · University of Massachusetts Amherst

Multi-turn LLM jailbreak framework using lifelong-learning agents achieves 81.4% ASR on OpenAI o3 via structured Primer-Planner-Finisher attack phases

Prompt Injection Red-Team Agents nlp

PDF

attack arXiv Oct 7, 2025 · Oct 2025

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh, Anshuman Suri, Yuefeng Peng et al. · University of Massachusetts Amherst · Northeastern University

Deanonymizes text-to-image leaderboard models via CLIP embedding signatures, enabling rank manipulation attacks with near-perfect accuracy

Output Integrity Attack visiongenerative

PDF

defense IEEE International Conference ... Oct 1, 2025 · Oct 2025

Integrated Security Mechanisms for Weight Protection in Memristive Crossbar Arrays

Muhammad Faheemur Rahman, Wayne Burleson · University of Massachusetts Amherst

Hardware security mechanisms scramble and watermark neural network weights in memristive arrays to prevent IP theft with under 10% overhead

Model Theft

2 citations PDF

defense arXiv Sep 18, 2025 · Sep 2025

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting

Aarushi Mahajan, Wayne Burleson · University of Massachusetts Amherst

Defends RFFI ML models from copying and evasion via trigger watermarks and VAE anomaly detection on LoRa spectrograms

Model Theft Input Manipulation Attack audio

PDF

defense arXiv Sep 1, 2025 · Sep 2025

Throttling Web Agents Using Reasoning Gates

Abhinav Kumar, Jaechul Roh, Ali Naseh et al. · University of Massachusetts Amherst

Proposes reasoning-puzzle throttling gates to impose asymmetric compute costs on LLM web agents and prevent DoS-style overload

Excessive Agency nlp

PDF

attack arXiv Aug 27, 2025 · Aug 2025

Network-Level Prompt and Trait Leakage in Local Research Agents

Hyejun Jeong, Mohammadreza Teymoorianfard, Abhinav Kumar et al. · University of Massachusetts Amherst

Passive network observer recovers user prompts and traits from LLM research agents via DNS/IP timing side-channels

Sensitive Information Disclosure nlp

PDF Code

Latest papers

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Understanding and Mitigating Dataset Corruption in LLM Steering

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Improving User Privacy in Personalized Generation: Client-Side Retrieval-Augmented Modification of Server-Side Generated Speculations

Identifying Models Behind Text-to-Image Leaderboards

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Integrated Security Mechanisms for Weight Protection in Memristive Crossbar Arrays

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting

Throttling Web Agents Using Reasoning Gates

Network-Level Prompt and Trait Leakage in Local Research Agents

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue