α

Published on arXiv

2508.17767

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Analyzing LLM internal states before generation effectively detects and prevents copyrighted training data leakage, fitting smoothly into existing AI workflows.

ISACL

Novel technique introduced


Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}


Key Contributions

  • Proactive internal-state analysis approach that examines LLM hidden states before text generation to flag impending copyrighted data leakage
  • Neural network classifier trained on curated copyrighted materials to identify leakage risk from LLM activations
  • Integration of ISACL into a RAG pipeline for copyright-aware generation with early-stopping or output alteration

🛡️ Threat Analysis

Model Inversion Attack

Defends against LLM memorization extraction — the scenario where an LLM reconstructs/reproduces copyrighted training data verbatim. Analyzing internal states to detect impending training-data reconstruction is a direct defense against the model inversion / training data extraction threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timewhite_box
Datasets
custom curated copyrighted materials dataset
Applications
llm text generationrag systemscopyright compliance