attack 2026

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack 1,2, Eugene Belilovsky 1,2, Elvis Dohmatob 1,2

0 citations

α

Published on arXiv

2603.04355

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves up to 11% higher attack success rates than state-of-the-art Refusal Feature Ablation across six models (7B–32B) while maintaining comparable perplexity.

PCA-Gaussian OT Refusal Ablation

Novel technique introduced


Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.


Key Contributions

  • First application of Gaussian optimal transport to representation-level LLM jailbreaking, framing refusal bypass as a distributional matching problem rather than single-direction removal.
  • PCA-regularized transport that restricts computation to a low-dimensional subspace, enabling efficient high-dimensional activation transformation with improved attack success rates.
  • Empirical discovery that layer-selective OT intervention at 40–60% network depth substantially outperforms full-network intervention, suggesting refusal mechanisms are geometrically localized.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBenchLlama-2-7B/13BLlama-3.1-8BQwen-2.5-7B/14B/32B
Applications
large language model safetyllm safety alignment bypass