defense 2025

Adaptive Backtracking for Privacy Protection in Large Language Models

Zhihao Yao 1, Yuxuan Gu 2, Xiachong Feng 2, Weitao Ma 2, Bo Li 1, Xiaocheng Feng 2

0 citations

α

Published on arXiv

2508.06087

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ABack improves the overall privacy-utility score by up to 15% over strong baselines without the performance degradation caused by data sanitization methods.

ABack

Novel technique introduced


The preservation of privacy has emerged as a critical topic in the era of artificial intelligence. However, current work focuses on user-oriented privacy, overlooking severe enterprise data leakage risks exacerbated by the Retrieval-Augmented Generation paradigm. To address this gap, our paper introduces a novel objective: enterprise-oriented privacy concerns. Achieving this objective requires overcoming two fundamental challenges: existing methods such as data sanitization severely degrade model performance, and the field lacks public datasets for evaluation. We address these challenges with several solutions. (1) To prevent performance degradation, we propose ABack, a training-free mechanism that leverages a Hidden State Model to pinpoint the origin of a leakage intention and rewrite the output safely. (2) To solve the lack of datasets, we construct PriGenQA, a new benchmark for enterprise privacy scenarios in healthcare and finance. To ensure a rigorous evaluation, we move beyond simple static attacks by developing a powerful adaptive attacker with Group Relative Policy Optimization. Experiments show that against this superior adversary, ABack improves the overall privacy utility score by up to 15\% over strong baselines, avoiding the performance trade-offs of prior methods.


Key Contributions

  • ABack: a training-free adaptive backtracking mechanism using a Hidden State Model to detect and suppress privacy leakage intention at its origin during generation
  • PriGenQA: a new benchmark covering enterprise privacy scenarios in healthcare and finance domains
  • An adaptive GRPO-trained attacker that provides a stronger evaluation baseline than static attacks

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
PriGenQA
Applications
rag systemsenterprise knowledge basesquestion answering