ML Security Papers

Latest papers

26 papers

attack arXiv Apr 27, 2026 · 24d ago

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li, Benjamin C. M. Fung, Boyang Li et al. · McGill University · Kean University +2 more

White-box jailbreak optimizing prompt embeddings directly instead of appending adversarial tokens, achieving higher success rates

Input Manipulation Attack Prompt Injection nlp

PDF

benchmark arXiv Apr 25, 2026 · 26d ago

Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

Taha Hammadia, Lucas Rea, Ahmad Mohammad Saber et al. · University of Toronto · Concordia University

Benchmarks three jailbreak attacks against LLMs used for smart grid compliance, finding 33% overall success rate with DeepInception most effective

Prompt Injection nlp

PDF

benchmark arXiv Mar 19, 2026 · 9w ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative

PDF Code

benchmark arXiv Feb 6, 2026 · Feb 2026

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp

1 citations PDF Code

defense arXiv Feb 3, 2026 · Feb 2026

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey et al. · University of Eastern Finland · Laboratoire de Physique de l'Ecole Normale Supérieure +2 more

Proposes wavelet scattering transform features for interpretable speech deepfake detection, outperforming SSL front-ends on a challenging benchmark

Output Integrity Attack audio

PDF

defense arXiv Jan 28, 2026 · Jan 2026

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp

PDF

attack arXiv Jan 26, 2026 · Jan 2026

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho et al. · Singapore Institute of Technology · Aristotle University of Thessaloniki +3 more

Agentic VLM/LLM system orchestrates CW, JSMA, and STA attacks to evade deepfake detectors with improved black-box transfer

Input Manipulation Attack visionmultimodalnlp

PDF

benchmark arXiv Jan 19, 2026 · Jan 2026

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong et al. · University of Cincinnati · University of Toronto +1 more

Compares six LLM fine-tuning objectives and finds ORPO and KL-regularization best preserve jailbreak resistance and alignment at scale

Transfer Learning Attack Prompt Injection nlp

PDF

defense arXiv Jan 14, 2026 · Jan 2026

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal

1 citations PDF

survey arXiv Jan 14, 2026 · Jan 2026

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier et al. · Ben-Gurion University of the Negev · Tel Aviv University +2 more

Surveys 36 LLM attack incidents and proposes a seven-stage promptware kill chain mapping prompt injection to multi-step malware delivery

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Dec 9, 2025 · Dec 2025

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu · University of Toronto

Multi-agent debate framework that transforms explicit harmful queries into stealthy variants that evade LLM safety mechanisms

Prompt Injection nlp

PDF

benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp

PDF

defense arXiv Nov 26, 2025 · Nov 2025

HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

Kexin Li, Guozhen Ding, Ilya Grishchenko et al. · University of Toronto

Radioactive multi-bit image watermark embedded in diffusion h-space that survives LoRA fine-tuning to detect unauthorized training data use

Output Integrity Attack visiongenerative

PDF

attack arXiv Nov 26, 2025 · Nov 2025

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko et al. · University of Toronto

Attacks AI-generated audio watermarks via dual-path convolutional autoencoder, defeating AudioSeal, WavMark, and Silentcipher at near real-time speed

Output Integrity Attack audiogenerative

PDF

attack arXiv Nov 24, 2025 · Nov 2025

Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data

Dominik Luszczynski · University of Toronto

Novel slope-based adversarial attacks on N-HiTS stock forecasting that double predicted trend slopes and evade CNN discriminator defenses

Input Manipulation Attack timeseriesgenerative

PDF

benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular

1 citations PDF Code

defense arXiv Oct 8, 2025 · Oct 2025

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

Shuangyi Chen, Ashish Khisti · University of Toronto

Proposes SurpMark, a black-box AI-text detector using token surprisal state-transition matrices and generalized Jensen-Shannon divergence

Output Integrity Attack nlp

PDF

defense arXiv Oct 6, 2025 · Oct 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang et al. · University of Eastern Finland · National Institute of Informatics +4 more

Novel wavelet prompt-tuning architecture for speech deepfake detection, outperforming SOTA on two benchmarks with far fewer trainable parameters

Output Integrity Attack audio

1 citations PDF Code

benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp

PDF Code

benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp

PDF Code

Loading more papers…

Latest papers

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

How does information access affect LLM monitors' ability to detect sabotage?

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Are LLMs Good Safety Agents or a Propaganda Engine?

HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data

An Investigation of Memorization Risk in Healthcare Foundation Models

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue