ML Security Papers

Latest papers

10 papers

attack arXiv Apr 27, 2026 · 24d ago

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li, Benjamin C. M. Fung, Boyang Li et al. · McGill University · Kean University +2 more

White-box jailbreak optimizing prompt embeddings directly instead of appending adversarial tokens, achieving higher success rates

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Mar 27, 2026 · 7w ago

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries et al. · McGill University · Mila - Québec AI Institute +1 more

Training-free concept erasure for diffusion models that removes unwanted concepts while preserving semantically related neighboring concepts

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Mar 19, 2026 · 9w ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative

PDF Code

attack arXiv Mar 16, 2026 · 9w ago

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu et al. · vLLM Semantic Router Project · MBZUAI +3 more

Introduces visual confused deputy attacks on GUI agents via screenshot manipulation and proposes dual-channel guardrails verifying both visual targets and textual reasoning

Input Manipulation Attack Output Integrity Attack Excessive Agency visionmultimodalnlp

PDF Code

attack arXiv Mar 2, 2026 · 11w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp

PDF Code

defense arXiv Dec 2, 2025 · Dec 2025

Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau · McGill University

Defends LLMs against long-context jailbreaks by inserting runtime control sentences into context, without retraining

Prompt Injection nlp

PDF

defense arXiv Nov 16, 2025 · Nov 2025

LLM Reinforcement in Context

Thomas Rivasseau · McGill University

Proposes inserting periodic alignment reminders into LLM context to defend against long-input jailbreaks and CoT scheming

Prompt Injection nlp

PDF

defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp

8 citations PDF

defense arXiv Oct 3, 2025 · Oct 2025

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar et al. · LIRIS - CNRS · Esker +3 more

Defends LLM web agents against indirect prompt injection by pruning accessibility tree observations with a lightweight LLM retriever

Prompt Injection nlp

2 citations PDF

benchmark arXiv Sep 11, 2025 · Sep 2025

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan et al. · McGill University · Mila - Quebec Artificial Intelligence Institute +2 more

Introduces a 4M-image benchmark dataset and crowdsourced adversarial platform for detecting deepfakes from modern diffusion/transformer generators

Output Integrity Attack vision

PDF Code

Latest papers

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Invasive Context Engineering to Control Large Language Models

LLM Reinforcement in Context

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue