Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

Key Contributions

First transferable DPI attack guided by surrogate model activations, eliminating the need to query the victim model directly
Energy-based Model (EBM) constructed from surrogate activations to score and guide adversarial prompt quality
Token-level MCMC sampling strategy for gradient-free optimization of natural adversarial prompts with high cross-model transferability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

InjecAgentTensorTrust

Applications

llm chatbotsllm-integrated applicationscloud-based llm apis

2025 0 cit.

100%