Latest papers

2 papers
attack arXiv Mar 4, 2026 · 4w ago

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob · Concordia University · Mila – Québec AI Institute

Optimal transport attack transforms harmful LLM activation distributions to match harmless ones, achieving 11% higher jailbreak success than refusal-direction ablation baselines.

Prompt Injection nlp
PDF
attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp
1 citations PDF Code