ML Security Papers

attack arXiv Mar 4, 2026 · 4w ago

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob · Concordia University · Mila – Québec AI Institute

Optimal transport attack transforms harmful LLM activation distributions to match harmless ones, achieving 11% higher jailbreak success than refusal-direction ablation baselines.

Prompt Injection nlp

PDF

attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp

1 citations PDF Code

Latest papers

Efficient Refusal Ablation in LLM through Optimal Transport

Active Attacks: Red-teaming LLMs via Adaptive Environments

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue