defense arXiv Feb 7, 2026 · 8w ago
Yuhao Wang, Shengfang Zhai, Guanghao Jin et al. · National University of Singapore · Southern University of Science and Technology +1 more
Defends LLM agent memory from adversarial data extraction by injecting optimized honeypot documents with SPRT-based sequential attacker detection
Sensitive Information Disclosure nlp
Large Language Model (LLM)-based agents employ external and internal memory systems to handle complex, goal-oriented tasks, yet this exposes them to severe extraction attacks, and effective defenses remain lacking. In this paper, we propose MemPot, the first theoretically verified defense framework against memory extraction attacks by injecting optimized honeypots into the memory. Through a two-stage optimization process, MemPot generates trap documents that maximize the retrieval probability for attackers while remaining inconspicuous to benign users. We model the detection process as Wald's Sequential Probability Ratio Test (SPRT) and theoretically prove that MemPot achieves a lower average number of sampling rounds compared to optimal static detectors. Empirically, MemPot significantly outperforms state-of-the-art baselines, achieving a 50% improvement in detection AUROC and an 80% increase in True Positive Rate under low False Positive Rate constraints. Furthermore, our experiments confirm that MemPot incurs zero additional online inference latency and preserves the agent's utility on standard tasks, verifying its superiority in safety, harmlessness, and efficiency.
llm National University of Singapore · Southern University of Science and Technology · Tsinghua University
defense arXiv Oct 3, 2025 · Oct 2025
Linyu Wu, Linhao Zhong, Wenjie Qu et al. · National University of Singapore · Zhejiang University
Watermarks diffusion LLM text outputs via order-agnostic predictive and bidirectional strategies, achieving 92–99.5% detection at 1% FPR
Output Integrity Attack nlp
Diffusion large language models (dLLMs) offer faster generation than autoregressive models while maintaining comparable quality, but existing watermarking methods fail on them due to their non-sequential decoding. Unlike autoregressive models that generate tokens left-to-right, dLLMs can finalize tokens in arbitrary order, breaking the causal design underlying traditional watermarks. We present DMark, the first watermarking framework designed specifically for dLLMs. DMark introduces three complementary strategies to restore watermark detectability: predictive watermarking uses model-predicted tokens when actual context is unavailable; bidirectional watermarking exploits both forward and backward dependencies unique to diffusion decoding; and predictive-bidirectional watermarking combines both approaches to maximize detection strength. Experiments across multiple dLLMs show that DMark achieves 92.0-99.5% detection rates at 1% false positive rate while maintaining text quality, compared to only 49.6-71.2% for naive adaptations of existing methods. DMark also demonstrates robustness against text manipulations, establishing that effective watermarking is feasible for non-autoregressive language models.
llm transformer National University of Singapore · Zhejiang University