attack 2025

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

0 citations

Published on arXiv

2509.12221

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves ≥87% attack success rate on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B while reducing cross-topic capability leakage by up to 90% versus single-direction refusal editing baselines.

MEUV (Mutually Exclusive Unlock Vectors)

Novel technique introduced

Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier "refusal-direction" edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.

Key Contributions

MEUV framework that factorizes the monolithic LLM refusal direction into topic-aligned, nearly orthogonal vectors enabling fine-grained per-topic safety bypass
Multi-task training objective combining differential-ablation margin, cross-topic penalty, and orthogonality regularization to achieve up to 90% reduction in cross-topic leakage
Empirical demonstration of language-agnostic refusal subspace: vectors trained in Chinese transfer to English (and vice versa) with minimal degradation across Gemma-2-2B, LLaMA-3-8B, and Qwen-7B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

bilingual malicious-prompt benchmarks (Chinese and English)

Applications

llm safety alignment bypasscontrolled llm deployment

Read PDF arXiv

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ShallowJail: Steering Jailbreaks against Large Language Models

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models