defense 2025

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

0 citations

Published on arXiv

2508.17767

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Analyzing LLM internal states before generation effectively detects and prevents copyrighted training data leakage, fitting smoothly into existing AI workflows.

ISACL

Novel technique introduced

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}

Key Contributions

Proactive internal-state analysis approach that examines LLM hidden states before text generation to flag impending copyrighted data leakage
Neural network classifier trained on curated copyrighted materials to identify leakage risk from LLM activations
Integration of ISACL into a RAG pipeline for copyright-aware generation with early-stopping or output alteration

🛡️ Threat Analysis

Model Inversion Attack

Defends against LLM memorization extraction — the scenario where an LLM reconstructs/reproduces copyrighted training data verbatim. Analyzing internal states to detect impending training-data reconstruction is a direct defense against the model inversion / training data extraction threat.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timewhite_box

Datasets

custom curated copyrighted materials dataset

Applications

llm text generationrag systemscopyright compliance

Read PDF arXiv Code

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

Good-Enough LLM Obfuscation (GELO)

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Reverse-Engineering Model Editing on Language Models

Data-Free Privacy-Preserving for LLMs via Model Inversion and Selective Unlearning