One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs
Yixin Tan 1, Zhe Yu 2, Jun Sakuma 1,2
Published on arXiv
2512.14751
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
PGP achieves superior adversarial prompt transfer from pretrained to finetuned LLMs across multiple model families under a realistic white-box-pretrained / black-box-finetuned threat model, exposing a systemic security risk in disclosing pretrained base models.
Probe-Guided Projection (PGP)
Novel technique introduced
Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.
Key Contributions
- Empirically demonstrates that adversarial prompts optimized on a pretrained LLM transfer more effectively to its finetuned derivatives than to unrelated models, revealing inherited jailbreak vulnerabilities in the pretrain-to-finetune paradigm.
- Shows via representation-level probing that transferable adversarial prompts are linearly separable in pretrained hidden states, suggesting universal transferability is encoded in pretrained representations.
- Proposes Probe-Guided Projection (PGP), which steers gradient-based adversarial optimization toward transferability-relevant directions, achieving superior jailbreak transfer across multiple LLM families and diverse finetuned tasks.
🛡️ Threat Analysis
PGP uses gradient-based adversarial suffix optimization (token-level perturbations on the pretrained model) steered toward representation-space directions that survive finetuning — this is squarely adversarial input manipulation at inference time, analogous to GCG-style attacks.