Latest papers

2 papers
attack arXiv Dec 14, 2025 · Dec 2025

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan, Zhe Yu, Jun Sakuma · Institute of Science Tokyo · RIKEN AIP

PGP attack exploits pretrained LLM representations to transfer gradient-optimized jailbreak prompts to black-box finetuned derivatives

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Oct 9, 2025 · Oct 2025

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai et al. · Institute of Science Tokyo · RIKEN AIP

Multi-turn jailbreak framework using five structured conversation patterns to systematically bypass LLM safety alignment across twelve models

Prompt Injection nlp
1 citations PDF Code