ML Security Papers

Latest papers

5 papers

defense arXiv Oct 6, 2025 · Oct 2025

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri et al. · ServiceNow Research · Mila - Québec AI Institute +3 more

Modular agent-tool firewall achieves perfect indirect prompt injection defense on four benchmarks, while exposing those benchmarks as too weak

Prompt Injection nlp

4 citations PDF

attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp

1 citations PDF Code

benchmark arXiv Sep 11, 2025 · Sep 2025

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan et al. · McGill University · Mila - Quebec Artificial Intelligence Institute +2 more

Introduces a 4M-image benchmark dataset and crowdsourced adversarial platform for detecting deepfakes from modern diffusion/transformer generators

Output Integrity Attack vision

PDF Code

defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp

PDF

benchmark arXiv Jan 3, 2025 · Jan 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al. · College Park · University of Toronto +3 more

Benchmarks 13 audio-visual LLMs on adversarial robustness, compositional reasoning, and modality dependency with 600K samples, plus a preference-optimization defense

Input Manipulation Attack audiomultimodalnlp

12 citations PDF

Latest papers

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Active Attacks: Red-teaming LLMs via Adaptive Environments

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue