attack 2025

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Tianyu Zhang , Zihang Xi , Jingyu Hua , Sheng Zhong

0 citations · 28 references · arXiv

α

Published on arXiv

2511.22044

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The proxy model achieves 91.1% accuracy predicting pairwise ASR rankings and 69.2% absolute ASR prediction accuracy on unseen dangerous topics, confirming that jailbreak behavior is predictable and distillable.

Outline Filling Attack / Narrow Safety Proxy

Novel technique introduced


In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.


Key Contributions

  • Outline Filling Attack: decomposes dangerous questions into hierarchical outlines and induces LLMs to fill in content, generating diverse yet semantically similar jailbreak prompts for dense boundary sampling
  • Ranking Regression paradigm: trains a Narrow Safety Proxy to predict which of two prompts yields higher ASR (pairwise ranking), sidestepping domain-shift problems of absolute ASR regression
  • Empirical confirmation that LLM security judgment logic is distillable — proxy achieves 91.1% pairwise ranking accuracy and 69.2% ASR prediction accuracy on unseen dangerous questions

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
Llama-3 (target model)GPT-4o-mini (target model)Qwen (target model)
Applications
llm safety evaluationchatbot jailbreaking