attack 2026

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok 1, Zachary Coalson 1, Sungho Keum 2, Sooel Son 2, Sanghyun Hong 1

0 citations · 36 references · arXiv (Cornell University)

α

Published on arXiv

2602.16980

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Steering along universal activation directions substantially increases PII leakage compared to existing prompt-based extraction methods while minimally affecting generation quality.

UniLeak

Novel technique introduced


Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.


Key Contributions

  • UniLeak framework that discovers universal activation directions in the residual stream whose linear addition at inference time reliably amplifies PII generation probability
  • Attack requires no access to training data or ground-truth PII — relies only on self-generated text to recover these directions
  • Demonstrates that steering along these directions outperforms existing prompt-based PII extraction methods across multiple LLMs and datasets

🛡️ Threat Analysis

Model Inversion Attack

UniLeak extracts private training data (PII) from language models by manipulating internal representations — a model inversion/memorization extraction attack. The adversary reconstructs memorized PII using self-generated text and activation steering without needing ground-truth training data.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Applications
language model privacypii extractionmechanistic interpretability