benchmark 2026

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li 1, Wei Zhao 1, Zhe Li 1, Nay Myat Min 1, Hanxun Huang 2, Yunhan Zhao 3, Xingjun Ma 3, Yu-Gang Jiang 3, Jun Sun 1

0 citations

α

Published on arXiv

2603.07452

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Beneficial backdoors embedded in Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B achieve high controllability and tamper-resistance while maintaining clean-task performance across all four trust-centric tasks.

Backdoor4Good (B4G)

Novel technique introduced


Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.


Key Contributions

  • Triplet formulation (T, A, U) — Trigger, Activation mechanism, Utility function — that formalizes beneficial backdoor learning for LLMs
  • B4G-Bench: a standardized benchmark covering four trust-centric tasks (safety enhancement, style personalization, access control, watermarking/identity verification) across four LLMs
  • Empirical demonstration that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance

🛡️ Threat Analysis

Model Poisoning

The paper's primary subject is backdoor mechanisms — trigger-conditioned hidden behaviors — repurposed for beneficial goals. The entire framework is built on backdoor injection and activation, making ML10 the direct category regardless of whether the intent is malicious or beneficial.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargeted
Applications
safety alignmentaccess controlmodel identity watermarkingstyle personalizationllm trustworthiness