Latest papers

24 papers
benchmark arXiv Mar 19, 2026 · 18d ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Feb 3, 2026 · 8w ago

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey et al. · University of Eastern Finland · Laboratoire de Physique de l'Ecole Normale Supérieure +2 more

Proposes wavelet scattering transform features for interpretable speech deepfake detection, outperforming SSL front-ends on a challenging benchmark

Output Integrity Attack audio
PDF
defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
attack arXiv Jan 26, 2026 · 10w ago

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho et al. · Singapore Institute of Technology · Aristotle University of Thessaloniki +3 more

Agentic VLM/LLM system orchestrates CW, JSMA, and STA attacks to evade deepfake detectors with improved black-box transfer

Input Manipulation Attack visionmultimodalnlp
PDF
benchmark arXiv Jan 19, 2026 · 11w ago

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong et al. · University of Cincinnati · University of Toronto +1 more

Compares six LLM fine-tuning objectives and finds ORPO and KL-regularization best preserve jailbreak resistance and alignment at scale

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
survey arXiv Jan 14, 2026 · 11w ago

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier et al. · Ben-Gurion University of the Negev · Tel Aviv University +2 more

Surveys 36 LLM attack incidents and proposes a seven-stage promptware kill chain mapping prompt injection to multi-step malware delivery

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Dec 9, 2025 · Dec 2025

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu · University of Toronto

Multi-agent debate framework that transforms explicit harmful queries into stealthy variants that evade LLM safety mechanisms

Prompt Injection nlp
PDF
benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp
PDF
attack arXiv Nov 26, 2025 · Nov 2025

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko et al. · University of Toronto

Attacks AI-generated audio watermarks via dual-path convolutional autoencoder, defeating AudioSeal, WavMark, and Silentcipher at near real-time speed

Output Integrity Attack audiogenerative
PDF
defense arXiv Nov 26, 2025 · Nov 2025

HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

Kexin Li, Guozhen Ding, Ilya Grishchenko et al. · University of Toronto

Radioactive multi-bit image watermark embedded in diffusion h-space that survives LoRA fine-tuning to detect unauthorized training data use

Output Integrity Attack visiongenerative
PDF
attack arXiv Nov 24, 2025 · Nov 2025

Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data

Dominik Luszczynski · University of Toronto

Novel slope-based adversarial attacks on N-HiTS stock forecasting that double predicted trend slopes and evade CNN discriminator defenses

Input Manipulation Attack timeseriesgenerative
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular
1 citations PDF Code
defense arXiv Oct 8, 2025 · Oct 2025

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

Shuangyi Chen, Ashish Khisti · University of Toronto

Proposes SurpMark, a black-box AI-text detector using token surprisal state-transition matrices and generalized Jensen-Shannon divergence

Output Integrity Attack nlp
PDF
defense arXiv Oct 6, 2025 · Oct 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang et al. · University of Eastern Finland · National Institute of Informatics +4 more

Novel wavelet prompt-tuning architecture for speech deepfake detection, outperforming SOTA on two benchmarks with far fewer trainable parameters

Output Integrity Attack audio
1 citations PDF Code
benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp
PDF Code
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code
tool arXiv Sep 2, 2025 · Sep 2025

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Zhiyang Chen, Tara Saba, Xun Deng et al. · University of Toronto

Auditing framework exposes production LLMs reproducing memorized scam URLs via innocuous prompts, with guardrails detecting under 0.3% of cases

Data Poisoning Attack Training Data Poisoning Sensitive Information Disclosure nlp
PDF
attack arXiv Aug 21, 2025 · Aug 2025

Adversarial Attacks against Neural Ranking Models via In-Context Learning

Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri et al. · University of Waterloo · University of California +1 more

Uses LLM few-shot prompting to generate fluent adversarial documents that fool neural ranking models into elevating health misinformation

Input Manipulation Attack nlp
PDF
Loading more papers…