Latest papers

5 papers
defense arXiv Oct 6, 2025 · Oct 2025

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri et al. · ServiceNow Research · Mila - Québec AI Institute +3 more

Modular agent-tool firewall achieves perfect indirect prompt injection defense on four benchmarks, while exposing those benchmarks as too weak

Prompt Injection nlp
4 citations PDF
attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Sep 11, 2025 · Sep 2025

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan et al. · McGill University · Mila - Quebec Artificial Intelligence Institute +2 more

Introduces a 4M-image benchmark dataset and crowdsourced adversarial platform for detecting deepfakes from modern diffusion/transformer generators

Output Integrity Attack vision
PDF Code
defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp
PDF
benchmark arXiv Jan 3, 2025 · Jan 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al. · College Park · University of Toronto +3 more

Benchmarks 13 audio-visual LLMs on adversarial robustness, compositional reasoning, and modality dependency with 600K samples, plus a preference-optimization defense

Input Manipulation Attack audiomultimodalnlp
12 citations PDF