attack 2025

Language Models are Injective and Hence Invertible

15 citations · 3 influential · 32 references · arXiv

Published on arXiv

2510.15511

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

SipIt provably and exactly reconstructs input text from hidden activations in linear time, demonstrating that any intermediate LLM representation fully and efficiently leaks its corresponding input.

SipIt

Novel technique introduced

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

Key Contributions

Mathematical proof that transformer language models are injective (lossless) from initialization through training
Empirical validation via billions of collision tests across six state-of-the-art LLMs with zero collisions observed
SipIt: the first algorithm with provable linear-time guarantees for exact reconstruction of input text from hidden activations

🛡️ Threat Analysis

Model Inversion Attack

SipIt performs exact embedding inversion — recovering private input text from a model's hidden activations. Any sensitive content fed into an LLM can be reconstructed by an adversary with access to intermediate representations, satisfying the ML03 adversary test of a concrete data-reconstruction threat.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Applications

language model inferenceembedding/activation privacy

Read PDF arXiv DOI

Language Models are Injective and Hence Invertible

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reverse-Engineering Model Editing on Language Models

Rep2Text: Decoding Full Text from a Single LLM Token Representation

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

Retracing the Past: LLMs Emit Training Data When They Get Lost

Discovering Universal Activation Directions for PII Leakage in Language Models

Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage