ML Security Papers

Latest papers

39 papers

attack arXiv Mar 17, 2026 · 20d ago

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Yong Zou, Haoran Li, Fanxiao Li et al. · Yunnan University · Northeastern University +1 more

Black-box adversarial image prompt attack that bypasses concept unlearning in diffusion models, recovering erased copyrighted and harmful concepts

Input Manipulation Attack visionmultimodalgenerative

PDF Code

benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp

PDF

attack arXiv Feb 28, 2026 · 5w ago

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Ci Zhang, Zhaojun Ding, Chence Yang et al. · University of Georgia · Carnegie Mellon University +3 more

Attacks pruning-based unlearning in diffusion models by reviving erased concepts via side-channel signals from zeroed weight locations

Output Integrity Attack generativevision

PDF

benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp

3 citations PDF

survey arXiv Feb 23, 2026 · 6w ago

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Xiaochong Jiang, Shiqi Yang, Wenting Yang et al. · Northeastern University · New York University +2 more

Surveys runtime attack surfaces of agentic LLM systems, introducing the Viral Agent Loop self-propagating worm and a Zero-Trust defense architecture

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

attack arXiv Feb 15, 2026 · 7w ago

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Xiaojun Jia, Jie Liao, Simeng Qin et al. · Nanyang Technological University · Chongqing University +4 more

Automated framework crafts stealthy skill-based prompt injections against LLM coding agents using closed-loop refinement agents

Prompt Injection Insecure Plugin Design nlp

PDF

defense arXiv Feb 13, 2026 · 7w ago

TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

Wentao Xu, Zhongming Yao, Weihao Li et al. · Northeastern University · Zhejiang University +1 more

Defends constrained RL agents against temporally coupled adversarial observation attacks via novel cost constraints and dual reward defense

Input Manipulation Attack reinforcement-learning

PDF

benchmark arXiv Feb 13, 2026 · 7w ago

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan et al. · Northeastern University · Virtue AI +2 more

Benchmarks multi-turn jailbreaks in tool-using LLM agents and proposes ToolShield, a self-exploration defense reducing ASR by 30%

Prompt Injection Insecure Plugin Design nlp

PDF Code

tool arXiv Feb 9, 2026 · 8w ago

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead et al. · Northeastern University · Mozilla Corporation

Automated red-teaming framework that adaptively discovers indirect prompt injection attacks against LLM web agents via trajectory analysis

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Feb 3, 2026 · 8w ago

PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets

Haoran Li, Renyang Liu, Hongjia Liu et al. · Northeastern University · National University of Singapore +1 more

Spectral graph wavelet purification defense removes imperceptible adversarial perturbations from 3D point clouds without model modification

Input Manipulation Attack vision

PDF Code

attack arXiv Jan 30, 2026 · 9w ago

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Luze Sun, Alina Oprea, Eric Wong · Northeastern University · University of Pennsylvania

Carrier-constrained GCG attacks evade LLM-based code vulnerability detectors using behavior-preserving code transformations that transfer to black-box APIs

Input Manipulation Attack nlp

PDF Code

attack arXiv Jan 27, 2026 · 9w ago

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al. · Northeastern University · University of Cambridge +4 more

Poisons LLM CoT training data by corrupting reasoning traces to inject targeted behaviors into unseen domains without altering queries or answers

Data Poisoning Attack Training Data Poisoning nlp

PDF

defense arXiv Jan 15, 2026 · 11w ago

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng et al. · Northeastern University

Defends LLMs against jailbreaks by probing latent safety signals during decoding to detect and block harmful outputs early

Prompt Injection nlp

PDF Code

attack arXiv Jan 14, 2026 · 11w ago

Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh, Yuefeng Peng, Anshuman Suri et al. · University of Massachusetts Amherst · Northeastern University

Attacks T2I leaderboard anonymity by clustering model outputs in embedding space, deanonymizing 22 models from 150K images

Output Integrity Attack visiongenerative

PDF

defense arXiv Jan 13, 2026 · 11w ago

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Qitao Tan, Xiaoying Song, Ningxi Cheng et al. · University of Georgia · University of North Texas +2 more

Recovers LLM safety alignment eroded by fine-tuning via post-training quantization, without retraining, in 40 minutes on one GPU

Transfer Learning Attack Prompt Injection nlp

PDF Code

defense arXiv Dec 26, 2025 · Dec 2025

Scaling Adversarial Training via Data Selection

Youran Ye, Dejin Wang, Ajinkya Bhandare · Northeastern University

Efficient adversarial training selects critical samples via margin or gradient criteria, matching PGD robustness at 50% compute cost

Input Manipulation Attack vision

PDF Code

defense arXiv Dec 23, 2025 · Dec 2025

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang et al. · Westcliff University · University of Washington +3 more

Defends federated learning against Byzantine poisoning attacks using Shapley-based reputation scores while minimizing multi-cloud communication costs

Data Poisoning Attack federated-learningvision

PDF

defense arXiv Dec 19, 2025 · Dec 2025

PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference

Nuntipat Narkthong, Xiaolin Xu · Northeastern University

Hardware RISC-V core defense using random loop-order permutation to prevent EM side-channel extraction of NN weights on edge devices

Model Theft

PDF

attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp

10 citations PDF

defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp

PDF

Loading more papers…

Latest papers

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Agents of Chaos

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Identifying Models Behind Text-to-Image Leaderboards

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Scaling Adversarial Training via Data Selection

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Improved Pseudorandom Codes from Permuted Puzzles

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue