attack 2026

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

0 citations

Published on arXiv

2604.06811

Model Poisoning

OWASP ML Top 10 — ML10

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 97.2% attack success rate while maintaining 89.3% clean task accuracy on GPT-5.2-1211-Global

SkillTrojan

Novel technique introduced

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

Key Contributions

First systematic backdoor attack targeting reusable skill implementations in agent systems rather than model parameters or prompts
Attack framework that partitions encrypted payloads across multiple benign-looking skills and reconstructs them under trigger conditions
Dataset of 3,000+ backdoored skills with automated synthesis from arbitrary skill templates for scalable propagation

🛡️ Threat Analysis

Model Poisoning

Embeds hidden malicious behavior (backdoor) in skill implementations that activates only under specific trigger conditions while maintaining normal behavior otherwise — classic backdoor/trojan attack.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timetargeted

Datasets

EHR SQL

Applications

skill-based agent systemscode-based agentsehr sql queries

Read PDF arXiv

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Targeted Bit-Flip Attacks on LLM-Based Agents

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

TFL: Targeted Bit-Flip Attack on Large Language Model

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

COBRA: Catastrophic Bit-flip Reliability Analysis of State-Space Models