attack 2025

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan ¹, Zhe Yu ², Jun Sakuma ^1,2

¹ Institute of Science Tokyo

² Riken AIP

0 citations · 51 references · arXiv

Published on arXiv

2512.14751

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PGP achieves superior adversarial prompt transfer from pretrained to finetuned LLMs across multiple model families under a realistic white-box-pretrained / black-box-finetuned threat model, exposing a systemic security risk in disclosing pretrained base models.

Probe-Guided Projection (PGP)

Novel technique introduced

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

Key Contributions

Empirically demonstrates that adversarial prompts optimized on a pretrained LLM transfer more effectively to its finetuned derivatives than to unrelated models, revealing inherited jailbreak vulnerabilities in the pretrain-to-finetune paradigm.
Shows via representation-level probing that transferable adversarial prompts are linearly separable in pretrained hidden states, suggesting universal transferability is encoded in pretrained representations.
Proposes Probe-Guided Projection (PGP), which steers gradient-based adversarial optimization toward transferability-relevant directions, achieving superior jailbreak transfer across multiple LLM families and diverse finetuned tasks.

🛡️ Threat Analysis

Input Manipulation Attack

PGP uses gradient-based adversarial suffix optimization (token-level perturbations on the pretrained model) steered toward representation-space directions that survive finetuning — this is squarely adversarial input manipulation at inference time, analogous to GCG-style attacks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_timetargeted

Datasets

AdvBench

Applications

instruction-tuned chatbotscode generation assistantsllm apis

Read PDF arXiv DOI

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Dynamic Target Attack

Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability

Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

H-Node Attack and Defense in Large Language Models