ML Security Papers

Latest papers

19 papers

defense ACL 2026 (Findings) Apr 19, 2026 · 4w ago

Continual Safety Alignment via Gradient-Based Sample Selection

Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University

Gradient-based sample filtering during fine-tuning that preserves LLM safety alignment by removing high-gradient samples causing drift

Prompt Injection nlp

PDF

attack arXiv Apr 3, 2026 · 6w ago

Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

Wei Zou, Mingwen Dong, Miguel Romero Calvo et al. · Pennsylvania State University · Amazon Web Services

Memory poisoning attack on LLM web agents via contaminated webpage observations, achieving persistent cross-session compromise

Data Poisoning Attack Prompt Injection Excessive Agency nlpmultimodal

PDF

benchmark arXiv Feb 28, 2026 · 11w ago

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Ruihao Pan, Suhang Wang · Pennsylvania State University

Shows LLM unlearning fails under multi-turn interaction; self-correction and dialogue history recover supposedly forgotten hazardous or private knowledge

Prompt Injection Sensitive Information Disclosure nlp

PDF

tool arXiv Feb 13, 2026 · Feb 2026

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp

PDF

attack arXiv Feb 8, 2026 · Feb 2026

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu et al. · Pennsylvania State University · Bangladesh University of Engineering and Technology

Exploits VLM safety alignment gaps using split-image inputs to jailbreak modern VLMs with 60% better transfer than baselines

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF

benchmark arXiv Jan 27, 2026 · Jan 2026

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu et al. · The University of Texas at Austin · New York University +3 more

Benchmarks VLM susceptibility to persuasive conflicting text prompts that override visual evidence, finding 48% average accuracy drop

Prompt Injection visionnlpmultimodal

PDF

attack arXiv Jan 21, 2026 · Jan 2026

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Shuhua Yang, Jiahao Zhang, Yilong Wang et al. · Pennsylvania State University

Black-box agentic attack that reconstructs up to 90% of a GraphRAG system's hidden knowledge graph via adaptive queries

Model Theft Sensitive Information Disclosure nlpgraph

PDF

attack arXiv Jan 20, 2026 · Jan 2026

Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks

Mohammad Shamim Ahsan, Peng Liu · Pennsylvania State University

Crafts benign MQTT packets that fool ML-based NIDSs into false positives, overwhelming SOC analysts with fabricated alerts

Input Manipulation Attack tabular

PDF

defense arXiv Dec 18, 2025 · Dec 2025

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Jirui Yang, Hengqi Guo, Zhihui Lu et al. · Fudan University · Ant Group +1 more

Defends LLMs against harmful prompts by comparing refusal vs. agreement prefix log-probabilities with near-zero inference overhead

Prompt Injection nlp

PDF

attack arXiv Nov 23, 2025 · Nov 2025

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Yanting Wang, Runpeng Geng, Jinghui Chen et al. · Pennsylvania State University

Combines gradient-based suffix optimization with semantic template optimization to jailbreak LLMs more effectively than either alone

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Nov 22, 2025 · Nov 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more

Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

attack arXiv Nov 20, 2025 · Nov 2025

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al. · Westlake University · Pennsylvania State University +2 more

Multimodal adversarial attack framework targeting VLA robots via visual patches, gradient-based text, and cross-modal misalignment attacks

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF

defense arXiv Nov 17, 2025 · Nov 2025

InfoDecom: Decomposing Information for Defending Against Privacy Leakage in Split Inference

Ruijun Deng, Zhihui Lu, Qiang Duan · Fudan University · Pennsylvania State University

Defends split inference against data reconstruction attacks by decomposing redundant smashed-data information before injecting calibrated privacy noise

Model Inversion Attack vision

PDF Code

defense arXiv Nov 15, 2025 · Nov 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University

Defends LLMs against jailbreaks by fixing gradient-decay-induced incomplete safety alignment via base-favored token penalties and teacher distillation

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Oct 15, 2025 · Oct 2025

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Wei Zou, Yupei Liu, Yanting Wang et al. · Pennsylvania State University · Duke University

Detects prompt injection in LLM applications using residual-stream representations and a lightweight linear classifier

Prompt Injection nlp

PDF

benchmark arXiv Sep 6, 2025 · Sep 2025

Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces

Zeyu Song, Sainyam Galhotra, Shagufta Mehnaz · Pennsylvania State University · Cornell University

Benchmarks robust aggregation defenses against Byzantine and Sybil attacks in decentralized federated gradient marketplaces with new economic metrics

Data Poisoning Attack federated-learning

PDF

attack arXiv Aug 26, 2025 · Aug 2025

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

Runpeng Geng, Yanting Wang, Ying Chen et al. · Pennsylvania State University

Injects 100 optimized adversarial documents into a RAG knowledge base to hijack LLM outputs for 2,000+ diverse queries

Input Manipulation Attack Prompt Injection nlp

PDF

tool arXiv Aug 15, 2025 · Aug 2025

SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

Haitong Luo, Weiyao Zhang, Suhang Wang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Detects LLM-generated text via spectral energy of token log-probability sequences using DFT/STFT, outperforming SOTA at half the runtime

Output Integrity Attack nlp

PDF Code

defense arXiv Aug 5, 2025 · Aug 2025

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang, Runpeng Geng, Ying Chen et al. · Pennsylvania State University

Attention-weight traceback method that pinpoints injected instructions in long-context LLMs, improving prompt injection detection in RAG pipelines

Prompt Injection Blue-Team Agents nlp

PDF Code

Latest papers

Continual Safety Alignment via Gradient-Based Sample Selection

Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

GPTZero: Robust Detection of LLM-Generated Texts

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

InfoDecom: Decomposing Information for Defending Against Privacy Leakage in Split Inference

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue