benchmark 2026

An Empirical Study on Remote Code Execution in Machine Learning Model Hosting Ecosystems

Mohammed Latif Siddiq ¹, Tanzim Hossain Romel ², Natalie Sekerak ¹, Beatrice Casey ¹, Joanna C. S. Santos ¹

¹ University of Notre Dame

² IQVIA Inc

0 citations · 73 references · arXiv

Published on arXiv

2601.14163

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Key Finding

Widespread unsafe defaults (e.g., permissive trust_remote_code usage), uneven security enforcement across five platforms, numerous CWE-categorized vulnerabilities in remotely loaded model code, and persistent developer confusion about RCE implications.

Model-sharing platforms, such as Hugging Face, ModelScope, and OpenCSG, have become central to modern machine learning development, enabling developers to share, load, and fine-tune pre-trained models with minimal effort. However, the flexibility of these ecosystems introduces a critical security concern: the execution of untrusted code during model loading (i.e., via trust_remote_code or trust_repo). In this work, we conduct the first large-scale empirical study of custom model loading practices across five major model-sharing platforms to assess their prevalence, associated risks, and developer perceptions. We first quantify the frequency with which models require custom code to function and identify those that execute arbitrary Python files during loading. We then apply three complementary static analysis tools: Bandit, CodeQL, and Semgrep, to detect security smells and potential vulnerabilities, categorizing our findings by CWE identifiers to provide a standardized risk taxonomy. We also use YARA to identify malicious patterns and payload signatures. In parallel, we systematically analyze the documentation, API design, and safety mechanisms of each platform to understand their mitigation strategies and enforcement levels. Finally, we conduct a qualitative analysis of over 600 developer discussions from GitHub, Hugging Face, and PyTorch Hub forums, as well as Stack Overflow, to capture community concerns and misconceptions regarding security and usability. Our findings reveal widespread reliance on unsafe defaults, uneven security enforcement across platforms, and persistent confusion among developers about the implications of executing remote code. We conclude with actionable recommendations for designing safer model-sharing infrastructures and striking a balance between usability and security in future AI ecosystems.

Key Contributions

First large-scale measurement of custom model loading practices across five major platforms (HuggingFace, ModelScope, OpenCSG, PyTorch Hub, PaddleHub), quantifying RCE exposure via trust_remote_code and trust_repo
Static analysis of model code using Bandit, CodeQL, and Semgrep to detect CWE-categorized security smells and vulnerabilities, complemented by YARA-based malware signature scanning
Qualitative analysis of 600+ developer discussions across GitHub, HuggingFace forums, and Stack Overflow revealing widespread misconceptions and unsafe defaults in model loading practices

🛡️ Threat Analysis

AI Supply Chain Attacks

The paper's primary focus is the AI supply chain: malicious or vulnerable code embedded in models distributed via public model hubs (HuggingFace, ModelScope, OpenCSG) that executes on a developer's machine during model loading via trust_remote_code/trust_repo — exactly the 'trojaned pre-trained models on model hubs' and 'compromised models distributed via public repositories' threat scenario. The YARA-based malware signature scanning and CWE-categorized vulnerability analysis further confirm this as a supply chain security study, not a backdoor-in-weights (ML10) study.

Details

Threat Tags

digital

Datasets

HuggingFace HubModelScopeOpenCSGPyTorch HubPaddleHub

Applications

model sharing platformsml model deployment

Read PDF arXiv DOI

An Empirical Study on Remote Code Execution in Machine Learning Model Hosting Ecosystems

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Secure Isn't: Assessing the Security of Machine Learning Model Sharing

Sentry: Authenticating Machine Learning Artifacts on the Fly

SafePickle: Robust and Generic ML Detection of Malicious Pickle-based ML Models

The Art of Hide and Seek: Making Pickle-Based Model Supply Chain Poisoning Stealthy Again

Deep Dive into the Abuse of DL APIs To Create Malicious AI Models and How to Detect Them

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

One RNG to Rule Them All: How Randomness Becomes an Attack Vector in Machine Learning

Verifiable Dropout: Turning Randomness into a Verifiable Claim