defense 2025

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang , Runpeng Geng , Ying Chen , Jinyuan Jia

Pennsylvania State University

0 citations

Published on arXiv

2508.03793

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

AttnTrace outperforms SOTA context traceback methods (e.g., TracLLM) in both accuracy and efficiency, reducing traceback time from hundreds of seconds to near real-time, while improving prompt injection detection under long-context settings

AttnTrace

Novel technique introduced

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

Key Contributions

AttnTrace: an attention-weight-based context traceback method that identifies which retrieved context segments most influenced an LLM response, more accurately and efficiently than SOTA methods like TracLLM
Two techniques for effective attention weight utilization with theoretical justification, enabling tractable traceback for long contexts
Attribution-before-detection paradigm that improves SOTA prompt injection detection in long-context LLMs, with demonstrated forensic application to real-world adversarial papers targeting LLM reviewers

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

retrieval-augmented generationautonomous agentsllm-generated reviewsprompt injection detection

Read PDF arXiv Code

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

Trust The Typical

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts