Latest papers

15 papers
attack arXiv Mar 16, 2026 · 21d ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury et al. · Virginia Tech · Penn State University +2 more

Jailbreak injection during test-time RL amplifies LLM harmful outputs and degrades reasoning performance simultaneously

Prompt Injection Training Data Poisoning nlp
PDF
defense arXiv Mar 3, 2026 · 4w ago

Understanding and Mitigating Dataset Corruption in LLM Steering

Cullen Anderson, Narmeen Oozeer, Foad Namjoo et al. · University of Massachusetts Amherst · Martian AI +2 more

Analyzes adversarial data poisoning of LLM contrastive steering datasets and defends with robust mean estimation

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Feb 9, 2026 · 8w ago

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more

Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok et al. · University of Massachusetts Amherst

Benchmarks domain-level LLM misalignment susceptibility from insecure fine-tuning and backdoor triggers, ranking 11 domains from 0% to 87.67% vulnerability

Transfer Learning Attack Model Poisoning nlp
PDF Code
defense arXiv Jan 24, 2026 · 10w ago

Improving User Privacy in Personalized Generation: Client-Side Retrieval-Augmented Modification of Server-Side Generated Speculations

Alireza Salemi, Hamed Zamani · University of Massachusetts Amherst

Privacy-preserving LLM personalization framework keeping user profiles client-side while resisting attribute inference and linkability attacks

Sensitive Information Disclosure nlp
PDF
attack arXiv Jan 14, 2026 · 11w ago

Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh, Yuefeng Peng, Anshuman Suri et al. · University of Massachusetts Amherst · Northeastern University

Attacks T2I leaderboard anonymity by clustering model outputs in embedding space, deanonymizing 22 models from 150K images

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Jan 12, 2026 · 12w ago

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

Weipeng Jiang, Xiaoyu Zhang, Juan Zhai et al. · Xi’an Jiaotong University · Nanyang Technological University +1 more

Discovers ASCII emoticons in prompts cause >38% semantic confusion in LLMs, producing syntactically valid but destructive silent failures in code generation

Prompt Injection nlp
PDF
defense arXiv Jan 9, 2026 · 12w ago

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari et al. · University of Massachusetts Amherst

Evaluates memory poisoning attacks on EHR LLM agents and proposes trust-scored I/O moderation and memory sanitization defenses

Prompt Injection nlp
1 citations PDF Code
defense arXiv Jan 8, 2026 · 12w ago

Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

Arghyadeep Das, Sai Sreenivas Chintha, Rishiraj Girmal et al. · University of Massachusetts Amherst

Defends against PII leakage in LLM chain-of-thought reasoning via prompt engineering and privacy-aware fine-tuning

Sensitive Information Disclosure nlp
1 citations PDF
attack arXiv Oct 20, 2025 · Oct 2025

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar · Inc. · University of Massachusetts Amherst

Multi-turn LLM jailbreak framework using lifelong-learning agents achieves 81.4% ASR on OpenAI o3 via structured Primer-Planner-Finisher attack phases

Prompt Injection nlp
PDF
attack arXiv Oct 7, 2025 · Oct 2025

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh, Anshuman Suri, Yuefeng Peng et al. · University of Massachusetts Amherst · Northeastern University

Deanonymizes text-to-image leaderboard models via CLIP embedding signatures, enabling rank manipulation attacks with near-perfect accuracy

Output Integrity Attack visiongenerative
PDF
defense IEEE International Conference ... Oct 1, 2025 · Oct 2025

Integrated Security Mechanisms for Weight Protection in Memristive Crossbar Arrays

Muhammad Faheemur Rahman, Wayne Burleson · University of Massachusetts Amherst

Hardware security mechanisms scramble and watermark neural network weights in memristive arrays to prevent IP theft with under 10% overhead

Model Theft
2 citations PDF
defense arXiv Sep 18, 2025 · Sep 2025

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting

Aarushi Mahajan, Wayne Burleson · University of Massachusetts Amherst

Defends RFFI ML models from copying and evasion via trigger watermarks and VAE anomaly detection on LoRa spectrograms

Model Theft Input Manipulation Attack audio
PDF
defense arXiv Sep 1, 2025 · Sep 2025

Throttling Web Agents Using Reasoning Gates

Abhinav Kumar, Jaechul Roh, Ali Naseh et al. · University of Massachusetts Amherst

Proposes reasoning-puzzle throttling gates to impose asymmetric compute costs on LLM web agents and prevent DoS-style overload

Excessive Agency nlp
PDF
attack arXiv Aug 27, 2025 · Aug 2025

Network-Level Prompt and Trait Leakage in Local Research Agents

Hyejun Jeong, Mohammadreza Teymoorianfard, Abhinav Kumar et al. · University of Massachusetts Amherst

Passive network observer recovers user prompts and traits from LLM research agents via DNS/IP timing side-channels

Sensitive Information Disclosure nlp
PDF Code