defense 2026

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng ^1,2, Zhongkui Ma ¹, Zihan Wang ^1,2, Alsharif Abuadbba ^2,3, Guangdong Bai ³

¹ The University of Queensland

² CSIRO’s Data61

³ City University of Hong Kong

0 citations · 62 references · arXiv (Cornell University)

Published on arXiv

2602.15897

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

GHOST reduces adversary token recovery rate to under 1% against state-of-the-art GIAs, protecting up to 98% more tokens than the best baseline defense while maintaining competitive model utility.

GHOST (Gradient Shield with Obfuscated Tokens)

Novel technique introduced

Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

Key Contributions

GHOST: a token-level obfuscation defense that replaces original tokens with semantically distinct but embedding-proximate 'shadow tokens', decoupling the gradient-embedding-token space alignment that GIAs exploit
A two-step search-and-select mechanism: multi-criteria neighbor filtering to identify candidate shadow tokens, followed by beam search selecting substitutes that minimize internal activation discrepancy across layers
Comprehensive evaluation across 6 model families (21 models from BERT to Llama), 6 datasets, and adaptive adversary scenarios, achieving ≤1% token recovery rate while preserving utility (0.92 F1, 5.45 perplexity)

🛡️ Threat Analysis

Model Inversion Attack

Directly defends against gradient inversion attacks (GIAs) — the canonical gradient leakage/reconstruction attack in collaborative/federated learning where an adversary reconstructs private training data from shared gradients. GHOST is explicitly designed to prevent this data reconstruction by decoupling gradient, embedding, and token spaces.

Details

Domains

nlpfederated-learning

Model Types

llmtransformer

Threat Tags

white_boxtraining_time

Datasets

6 diverse NLP datasets (unnamed in available excerpt)

Applications

collaborative llm fine-tuningtext classificationtext generationfederated learning

Read PDF arXiv DOI

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Navigating the Designs of Privacy-Preserving Fine-tuning for Large Language Models

Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

SecureGate: Learning When to Reveal PII Safely via Token-Gated Dual-Adapters for Federated LLMs

SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

Reconstructing Training Data from Adapter-based Federated Large Language Models

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage