attack 2026

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Guiyao Tie 1, Jiawen Shi 1, Pan Zhou 1, Lichao Sun 2

0 citations

α

Published on arXiv

2604.09378

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 99.5% average attack success rate across eight triggered skills with 91.7% ASR at only 3% poison rate while maintaining benign accuracy

BadSkill

Novel technique introduced


Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.


Key Contributions

  • Novel backdoor attack formulation targeting models bundled within agent skills using semantic trigger combinations in skill parameters
  • Composite training objective combining classification loss, margin-based separation, and poison-focused optimization achieving 99.5% ASR
  • OpenClaw-inspired benchmark with 13 skills across 8 architectures (494M-7.1B parameters) demonstrating attack effectiveness at 3% poison rate

🛡️ Threat Analysis

AI Supply Chain Attacks

Attack exploits the agent skill supply chain by distributing trojaned models as part of third-party installable skills, creating a model-in-skill supply-chain threat.

Model Poisoning

Core contribution is a backdoor attack that embeds hidden, trigger-activated malicious behavior in models bundled within agent skills using backdoor fine-tuning with composite loss objective.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargetedblack_box
Datasets
OpenClaw-inspired simulation571 negative-class queries396 trigger-aligned queries
Applications
ai agent ecosystemsthird-party skill systemsmodel supply chain