LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation
Xingyu Li 1,2, Xiaolei Liu 1,2, Cheng Liu 1,2, Yixiao Xu 3, Kangyi Ding 1,2, Bangzhou Xin 1,2, Jia-Li Yin 4
1 National Interdisciplinary Research Center of Engineering Physics
2 Institute of Computer Application, China Academy of Engineering Physics
Published on arXiv
2511.07876
Model Denial of Service
OWASP LLM Top 10 — LLM04
Key Finding
LoopLLM achieves over 90% of the maximum output length across 12 open-source LLMs (vs. ~20% for baselines) and improves black-box transferability by ~40% to commercial models DeepSeek-V3 and Gemini 2.5 Flash.
LoopLLM
Novel technique introduced
As large language models (LLMs) scale, their inference incurs substantial computational resources, exposing them to energy-latency attacks, where crafted prompts induce high energy and latency cost. Existing attack methods aim to prolong output by delaying the generation of termination symbols. However, as the output grows longer, controlling the termination symbols through input becomes difficult, making these methods less effective. Therefore, we propose LoopLLM, an energy-latency attack framework based on the observation that repetitive generation can trigger low-entropy decoding loops, reliably compelling LLMs to generate until their output limits. LoopLLM introduces (1) a repetition-inducing prompt optimization that exploits autoregressive vulnerabilities to induce repetitive generation, and (2) a token-aligned ensemble optimization that aggregates gradients to improve cross-model transferability. Extensive experiments on 12 open-source and 2 commercial LLMs show that LoopLLM significantly outperforms existing methods, achieving over 90% of the maximum output length, compared to 20% for baselines, and improving transferability by around 40% to DeepSeek-V3 and Gemini 2.5 Flash.
Key Contributions
- Repetition-inducing prompt optimization that exploits autoregressive vulnerabilities to lock LLMs into low-entropy decoding loops, reliably reaching maximum output length
- Token-aligned ensemble optimization that aggregates gradients across multiple surrogate models sharing a tokenizer to improve cross-model transferability
- Empirical evaluation on 12 open-source and 2 commercial LLMs showing >90% max output length achieved (vs. ~20% for baselines) and ~40% transferability gain to DeepSeek-V3 and Gemini 2.5 Flash