Latest papers

26 papers
attack arXiv Apr 27, 2026 · 24d ago

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li, Benjamin C. M. Fung, Boyang Li et al. · McGill University · Kean University +2 more

White-box jailbreak optimizing prompt embeddings directly instead of appending adversarial tokens, achieving higher success rates

Input Manipulation Attack Prompt Injection nlp
PDF
benchmark arXiv Apr 25, 2026 · 26d ago

Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

Taha Hammadia, Lucas Rea, Ahmad Mohammad Saber et al. · University of Toronto · Concordia University

Benchmarks three jailbreak attacks against LLMs used for smart grid compliance, finding 33% overall success rate with DeepInception most effective

Prompt Injection nlp
PDF
benchmark arXiv Mar 19, 2026 · 9w ago

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al. · Vector Institute · University of Waterloo +3 more

Competition evaluating membership inference attack resistance of diffusion models generating synthetic tabular data across white-box and black-box settings

Membership Inference Attack tabulargenerative
PDF Code
benchmark arXiv Feb 6, 2026 · Feb 2026

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Feb 3, 2026 · Feb 2026

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey et al. · University of Eastern Finland · Laboratoire de Physique de l'Ecole Normale Supérieure +2 more

Proposes wavelet scattering transform features for interpretable speech deepfake detection, outperforming SSL front-ends on a challenging benchmark

Output Integrity Attack audio
PDF
defense arXiv Jan 28, 2026 · Jan 2026

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
attack arXiv Jan 26, 2026 · Jan 2026

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho et al. · Singapore Institute of Technology · Aristotle University of Thessaloniki +3 more

Agentic VLM/LLM system orchestrates CW, JSMA, and STA attacks to evade deepfake detectors with improved black-box transfer

Input Manipulation Attack visionmultimodalnlp
PDF
benchmark arXiv Jan 19, 2026 · Jan 2026

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong et al. · University of Cincinnati · University of Toronto +1 more

Compares six LLM fine-tuning objectives and finds ORPO and KL-regularization best preserve jailbreak resistance and alignment at scale

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Jan 14, 2026 · Jan 2026

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
survey arXiv Jan 14, 2026 · Jan 2026

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier et al. · Ben-Gurion University of the Negev · Tel Aviv University +2 more

Surveys 36 LLM attack incidents and proposes a seven-stage promptware kill chain mapping prompt injection to multi-step malware delivery

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Dec 9, 2025 · Dec 2025

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu · University of Toronto

Multi-agent debate framework that transforms explicit harmful queries into stealthy variants that evade LLM safety mechanisms

Prompt Injection nlp
PDF
benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp
PDF
defense arXiv Nov 26, 2025 · Nov 2025

HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

Kexin Li, Guozhen Ding, Ilya Grishchenko et al. · University of Toronto

Radioactive multi-bit image watermark embedded in diffusion h-space that survives LoRA fine-tuning to detect unauthorized training data use

Output Integrity Attack visiongenerative
PDF
attack arXiv Nov 26, 2025 · Nov 2025

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko et al. · University of Toronto

Attacks AI-generated audio watermarks via dual-path convolutional autoencoder, defeating AudioSeal, WavMark, and Silentcipher at near real-time speed

Output Integrity Attack audiogenerative
PDF
attack arXiv Nov 24, 2025 · Nov 2025

Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data

Dominik Luszczynski · University of Toronto

Novel slope-based adversarial attacks on N-HiTS stock forecasting that double predicted trend slopes and evade CNN discriminator defenses

Input Manipulation Attack timeseriesgenerative
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular
1 citations PDF Code
defense arXiv Oct 8, 2025 · Oct 2025

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

Shuangyi Chen, Ashish Khisti · University of Toronto

Proposes SurpMark, a black-box AI-text detector using token surprisal state-transition matrices and generalized Jensen-Shannon divergence

Output Integrity Attack nlp
PDF
defense arXiv Oct 6, 2025 · Oct 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang et al. · University of Eastern Finland · National Institute of Informatics +4 more

Novel wavelet prompt-tuning architecture for speech deepfake detection, outperforming SOTA on two benchmarks with far fewer trainable parameters

Output Integrity Attack audio
1 citations PDF Code
benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp
PDF Code
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code
Loading more papers…