Latest papers

9 papers
defense arXiv Mar 27, 2026 · 10d ago

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries et al. · McGill University · Mila - Québec AI Institute +1 more

Training-free concept erasure for diffusion models that removes unwanted concepts while preserving semantically related neighboring concepts

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
attack arXiv Mar 16, 2026 · 21d ago

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu et al. · vLLM Semantic Router Project · MBZUAI +3 more

Introduces visual confused deputy attacks on GUI agents via screenshot manipulation and proposes dual-channel guardrails verifying both visual targets and textual reasoning

Input Manipulation Attack Output Integrity Attack Excessive Agency visionmultimodalnlp
PDF Code
attack arXiv Mar 2, 2026 · 5w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp
PDF Code
defense arXiv Dec 2, 2025 · Dec 2025

Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau · McGill University

Defends LLMs against long-context jailbreaks by inserting runtime control sentences into context, without retraining

Prompt Injection nlp
PDF
defense arXiv Nov 16, 2025 · Nov 2025

LLM Reinforcement in Context

Thomas Rivasseau · McGill University

Proposes inserting periodic alignment reminders into LLM context to defend against long-input jailbreaks and CoT scheming

Prompt Injection nlp
PDF
defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp
8 citations PDF
defense arXiv Oct 3, 2025 · Oct 2025

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar et al. · LIRIS - CNRS · Esker +3 more

Defends LLM web agents against indirect prompt injection by pruning accessibility tree observations with a lightweight LLM retriever

Prompt Injection nlp
2 citations PDF
benchmark arXiv Sep 11, 2025 · Sep 2025

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan et al. · McGill University · Mila - Quebec Artificial Intelligence Institute +2 more

Introduces a 4M-image benchmark dataset and crowdsourced adversarial platform for detecting deepfakes from modern diffusion/transformer generators

Output Integrity Attack vision
PDF Code