Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment
Victor Guyomard , Mathis Mauvisseau , Marie Paindavoine
Published on arXiv
2509.06371
Model Theft
OWASP ML Top 10 — ML05
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Successfully extracts the SafetyCore on-device model and generates adversarial images that bypass nudity/sensitive content detection, rendering the Google Messages content moderation protection ineffective
Due to hardware and software improvements, an increasing number of AI models are deployed on-device. This shift enhances privacy and reduces latency, but also introduces security risks distinct from traditional software. In this article, we examine these risks through the real-world case study of SafetyCore, an Android system service incorporating sensitive image content detection. We demonstrate how the on-device AI model can be extracted and manipulated to bypass detection, effectively rendering the protection ineffective. Our analysis exposes vulnerabilities of on-device AI models and provides a practical demonstration of how adversaries can exploit them.
Key Contributions
- First practical demonstration of model extraction from Google's SafetyCore on-device Android AI system via reverse engineering of the APK
- End-to-end attack pipeline: extract on-device model, convert it to a white-box target, then generate adversarial images to evade sensitive content detection
- Systematic analysis of why on-device AI deployment introduces security risks distinct from traditional software protections
🛡️ Threat Analysis
After extracting the model (transitioning from black-box to white-box), the paper crafts adversarial images that cause misclassification at inference time, bypassing nudity/sensitive content detection — a textbook input manipulation attack explicitly listed in the keywords as 'Adversarial examples'.
The paper demonstrates reverse-engineering and extracting the AI model embedded in the SafetyCore Android system service APK, giving an adversary direct access to the model's architecture and weights — a clear model theft via extraction.