α

Published on arXiv

2509.14608

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Existing probabilistic defenses fundamentally fail against data exfiltration in enterprise LLMs; only deterministic access control enforced at every stage of fine-tuning and RAG inference can provably prevent sensitive data leakage to unauthorized users.

Participant-Aware Access Control

Novel technique introduced


Large language models (LLMs) are increasingly deployed in enterprise settings where they interact with multiple users and are trained or fine-tuned on sensitive internal data. While fine-tuning enhances performance by internalizing domain knowledge, it also introduces a critical security risk: leakage of confidential training data to unauthorized users. These risks are exacerbated when LLMs are combined with Retrieval-Augmented Generation (RAG) pipelines that dynamically fetch contextual documents at inference time. We demonstrate data exfiltration attacks on AI assistants where adversaries can exploit current fine-tuning and RAG architectures to leak sensitive information by leveraging the lack of access control enforcement. We show that existing defenses, including prompt sanitization, output filtering, system isolation, and training-level privacy mechanisms, are fundamentally probabilistic and fail to offer robust protection against such attacks. We take the position that only a deterministic and rigorous enforcement of fine-grained access control during both fine-tuning and RAG-based inference can reliably prevent the leakage of sensitive data to unauthorized recipients. We introduce a framework centered on the principle that any content used in training, retrieval, or generation by an LLM is explicitly authorized for \emph{all users involved in the interaction}. Our approach offers a simple yet powerful paradigm shift for building secure multi-user LLM systems that are grounded in classical access control but adapted to the unique challenges of modern AI workflows. Our solution has been deployed in Microsoft Copilot Tuning, a product offering that enables organizations to fine-tune models using their own enterprise-specific data.


Key Contributions

  • Demonstrates that fine-tuned enterprise LLMs are vulnerable to training data extraction attacks and that existing probabilistic defenses (sanitization, output filtering, DP) fail to provide robust protection
  • Demonstrates indirect cross-prompt injection attacks (XPIA) on RAG pipelines that succeed even under human review, exfiltrating sensitive documents via steganographic encoding in model responses
  • Proposes a deterministic participant-aware access control framework — enforcing that any content used in training or retrieval is authorized for all users involved — deployed in Microsoft Copilot Tuning

🛡️ Threat Analysis

Model Inversion Attack

Paper explicitly demonstrates attacks where adversaries extract sensitive training data from fine-tuned enterprise LLMs, showing existing probabilistic defenses (differential privacy, output filtering) fail to prevent reconstruction of confidential training content.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxtraining_timeinference_timetargeted
Applications
enterprise ai assistantsrag pipelinesfine-tuned llm systemsmicrosoft copilot