attack 2026

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Miles Q. Li 1, Benjamin C. M. Fung 1, Boyang Li 2, Radin Hamidi Rad 3, Ebrahim Bagheri 4

0 citations

α

Published on arXiv

2604.24983

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Outperforms nanoGCG, SPT, and BEAST on ASR-Judge across four models on both AdvBench and HarmBench benchmarks

Prompt Embedding Optimization (PEO)

Novel technique introduced


Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.


Key Contributions

  • Prompt Embedding Optimization (PEO) that directly optimizes embeddings of existing prompt tokens without appending adversarial content
  • Adaptive multi-round schedule with structured continuation scaffolds that concentrates optimization budget on unsolved prompts
  • Demonstrates embedding perturbations preserve visible prompt text exactly (0% text change after projection) while achieving harmful outputs

🛡️ Threat Analysis

Input Manipulation Attack

Gradient-based adversarial attack optimizing continuous embeddings of prompt tokens to cause misaligned LLM outputs at inference time.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBenchHarmBench
Applications
llm safetychatbotsaligned language models