attack 2025

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan 1, Zhe Yu 2, Jun Sakuma 1,2

0 citations · 51 references · arXiv

α

Published on arXiv

2512.14751

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PGP achieves superior adversarial prompt transfer from pretrained to finetuned LLMs across multiple model families under a realistic white-box-pretrained / black-box-finetuned threat model, exposing a systemic security risk in disclosing pretrained base models.

Probe-Guided Projection (PGP)

Novel technique introduced


Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.


Key Contributions

  • Empirically demonstrates that adversarial prompts optimized on a pretrained LLM transfer more effectively to its finetuned derivatives than to unrelated models, revealing inherited jailbreak vulnerabilities in the pretrain-to-finetune paradigm.
  • Shows via representation-level probing that transferable adversarial prompts are linearly separable in pretrained hidden states, suggesting universal transferability is encoded in pretrained representations.
  • Proposes Probe-Guided Projection (PGP), which steers gradient-based adversarial optimization toward transferability-relevant directions, achieving superior jailbreak transfer across multiple LLM families and diverse finetuned tasks.

🛡️ Threat Analysis

Input Manipulation Attack

PGP uses gradient-based adversarial suffix optimization (token-level perturbations on the pretrained model) steered toward representation-space directions that survive finetuning — this is squarely adversarial input manipulation at inference time, analogous to GCG-style attacks.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_timetargeted
Datasets
AdvBench
Applications
instruction-tuned chatbotscode generation assistantsllm apis