attack 2025

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Andrew Zhao 1, Reshmi Ghosh 2, Vitor Carvalho 2, Emily Lawton 2, Keegan Hines 2, Gao Huang 1, Jack W. Stokes 2

1 citations · 46 references · arXiv

α

Published on arXiv

2510.14381

Data Poisoning Attack

OWASP ML Top 10 — ML02

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Feedback-based poisoning raises attack success rate by up to ΔASR=0.48 across two LLM optimizers, vastly exceeding query-only poisoning, while a highlighting defense reduces the fake reward attack's ΔASR from 0.23 to 0.07.

Fake Reward Attack

Novel technique introduced


Large language model (LLM) systems increasingly power everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on manually well-crafted prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to query poisoning alone: feedback-based attacks raise attack success rate (ASR) by up to ΔASR = 0.48. We introduce a simple fake reward attack that requires no access to the reward model and significantly increases vulnerability. We also propose a lightweight highlighting defense that reduces the fake reward ΔASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.


Key Contributions

  • First systematic analysis of poisoning risks in LLM-based prompt optimization pipelines, distinguishing query poisoning from feedback poisoning and showing the latter is far more dangerous
  • Fake reward attack that appends fabricated positive feedback tokens to inputs with no access to the reward model, raising ASR by up to ΔASR=0.48
  • Lightweight highlighting defense that makes the optimizer aware of potentially poisoned feedback, reducing fake reward ΔASR from 0.23 to 0.07 without utility degradation

🛡️ Threat Analysis

Data Poisoning Attack

The paper's core contribution is demonstrating that adversarially manipulated feedback in LLM prompt optimization loops constitutes a data/feedback poisoning attack — the corrupted optimization signal steers the system prompt toward unsafe behavior without touching model weights.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_boxtargeted
Datasets
HarmBench
Applications
chatbotsautonomous agentsllm prompt optimization systems