BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

Key Contributions

Novel backdoor attack formulation targeting models bundled within agent skills using semantic trigger combinations in skill parameters
Composite training objective combining classification loss, margin-based separation, and poison-focused optimization achieving 99.5% ASR
OpenClaw-inspired benchmark with 13 skills across 8 architectures (494M-7.1B parameters) demonstrating attack effectiveness at 3% poison rate

🛡️ Threat Analysis

AI Supply Chain Attacks

Attack exploits the agent skill supply chain by distributing trojaned models as part of third-party installable skills, creating a model-in-skill supply-chain threat.

Model Poisoning

Core contribution is a backdoor attack that embeds hidden, trigger-activated malicious behavior in models bundled within agent skills using backdoor fine-tuning with composite loss objective.