Black-box Optimization of LLM Outputs by Asking for Directions
Jie Zhang 1, Meng Ding 2, Yang Liu 3, Jue Hong 3, Robert Mullins 1
Published on arXiv
2510.16794
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 45–51% adversarial example success and improves jailbreak/prompt injection rates by up to 33% over transfer baselines using only textual API access on Claude and GPT models.
Asking for Directions (comparative-confidence hill-climbing)
Novel technique introduced
We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).
Key Contributions
- Novel optimization signal for text-only black-box attacks: prompting the victim LLM to compare two candidate inputs and report which is closer to the attack objective, enabling hill-climbing without logits or proxy models.
- Demonstration across three attack scenarios (adversarial examples on vision-LLMs, jailbreaks, prompt injections) on Claude and GPT models, requiring only 5–450 queries and improving over transfer-based baselines by up to 33%.
- Counterintuitive finding that larger, more capable models are better calibrated at comparative confidence expression, making them paradoxically more vulnerable to this class of attacks.
🛡️ Threat Analysis
One of the three attack scenarios is crafting imperceptible adversarial image perturbations to cause vision-LLM misclassification — a classic inference-time input manipulation attack, here optimized via gradient-free hill-climbing using the victim model's self-reported comparative confidence.