defense 2025

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang , Runpeng Geng , Ying Chen , Jinyuan Jia

0 citations

α

Published on arXiv

2508.03793

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AttnTrace outperforms SOTA context traceback methods (e.g., TracLLM) in both accuracy and efficiency, reducing traceback time from hundreds of seconds to near real-time, while improving prompt injection detection under long-context settings

AttnTrace

Novel technique introduced


Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.


Key Contributions

  • AttnTrace: an attention-weight-based context traceback method that identifies which retrieved context segments most influenced an LLM response, more accurately and efficiently than SOTA methods like TracLLM
  • Two techniques for effective attention weight utilization with theoretical justification, enabling tractable traceback for long contexts
  • Attribution-before-detection paradigm that improves SOTA prompt injection detection in long-context LLMs, with demonstrated forensic application to real-world adversarial papers targeting LLM reviewers

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
retrieval-augmented generationautonomous agentsllm-generated reviewsprompt injection detection