attack arXiv Oct 1, 2025 · Oct 2025
Xiangfang Li, Yu Wang, Bo Li · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more
Backdoor-based fine-tuning attack that jailbreaks GPT-4o and GPT-4.1 at 97%+ ASR by evading data filters, defensive fine-tuning, and safety audits
Model Poisoning Prompt Injection nlp
With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.
llm transformer Chinese Academy of Sciences · University of Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense
defense arXiv Nov 16, 2025 · Nov 2025
Haotian Jin, Yang Li, Haihui Fan et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +1 more
Defends LLMs against backdoor attacks by detecting abnormal inter-head attention similarity and realigning contaminated attention heads via fine-tuning
Model Poisoning nlp
Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model's performance on downstream tasks.
llm transformer Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense · University of Chinese Academy of Sciences