Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.
Graph-based retrieval-augmented generation (Graph RAG) is increasingly deployed to support LLM applications by augmenting user queries with structured knowledge retrieved from a knowledge graph. While Graph RAG improves relational reasoning, it introduces a largely understudied threat: adversaries can reconstruct subgraphs from a target RAG system's knowledge graph, enabling privacy inference and replication of curated knowledge assets. We show that existing attacks are largely ineffective against Graph RAG even with simple prompt-based safeguards, because these attacks expose explicit exfiltration intent and are therefore easily suppressed by lightweight safe prompts. We identify three technical challenges for practical Graph RAG extraction under realistic safeguards and introduce GRASP, a closed-box, multi-turn subgraph reconstruction attack. GRASP (i) reframes extraction as a context-processing task, (ii) enforces format-compliant, instance-grounded outputs via per-record identifiers to reduce hallucinations and preserve relational details, and (iii) diversifies goal-driven attack queries using a momentum-aware scheduler to operate within strict query budgets. Across two real-world knowledge graphs, four safety-aligned LLMs, and multiple Graph RAG frameworks, GRASP attains the strongest type-faithful reconstruction where prior methods fail, reaching up to 82.9 F1. We further evaluate defenses and propose two lightweight mitigations that substantially reduce reconstruction fidelity without utility loss.