attack 2025

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

Xin Tong 1, Zhi Lin 2, Jingya Wang 1, Meng Han 3, Bo Jin 4

0 citations

α

Published on arXiv

2509.12221

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves ≥87% attack success rate on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B while reducing cross-topic capability leakage by up to 90% versus single-direction refusal editing baselines.

MEUV (Mutually Exclusive Unlock Vectors)

Novel technique introduced


Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier "refusal-direction" edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.


Key Contributions

  • MEUV framework that factorizes the monolithic LLM refusal direction into topic-aligned, nearly orthogonal vectors enabling fine-grained per-topic safety bypass
  • Multi-task training objective combining differential-ablation margin, cross-topic penalty, and orthogonality regularization to achieve up to 90% reduction in cross-topic leakage
  • Empirical demonstration of language-agnostic refusal subspace: vectors trained in Chinese transfer to English (and vice versa) with minimal degradation across Gemma-2-2B, LLaMA-3-8B, and Qwen-7B

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
bilingual malicious-prompt benchmarks (Chinese and English)
Applications
llm safety alignment bypasscontrolled llm deployment