Critical Evaluation of Quantum Machine Learning for Adversarial Robustness

Quantum Machine Learning (QML) integrates quantum computational principles into learning algorithms, offering improved representational capacity and computational efficiency. Nevertheless, the security and robustness of QML systems remain underexplored, especially under adversarial conditions. In this paper, we present a systematization of adversarial robustness in QML, integrating conceptual organization with empirical evaluation across three threat models-black-box, gray-box, and white-box. We implement representative attacks in each category, including label-flipping for black-box, QUID encoder-level data poisoning for gray-box, and FGSM and PGD for white-box, using Quantum Neural Networks (QNNs) trained on two datasets from distinct domains: MNIST from computer vision and AZ-Class from Android malware, across multiple circuit depths (2, 5, 10, and 50 layers) and two encoding schemes (angle and amplitude). Our evaluation shows that amplitude encoding yields the highest clean accuracy (93% on MNIST and 67% on AZ-Class) in deep, noiseless circuits; however, it degrades sharply under adversarial perturbations and depolarization noise (p=0.01), dropping accuracy below 5%. In contrast, angle encoding, while offering lower representational capacity, remains more stable in shallow, noisy regimes, revealing a trade-off between capacity and robustness. Moreover, the QUID attack attains higher attack success rates, though quantum noise channels disrupt the Hilbert-space correlations it exploits, weakening its impact in image domains. This suggests that noise can act as a natural defense mechanism in Noisy Intermediate-Scale Quantum (NISQ) systems. Overall, our findings guide the development of secure and resilient QML architectures for practical deployment. These insights underscore the importance of designing threat-aware models that remain reliable under real-world noise in NISQ settings.

Key Contributions

Systematization of adversarial threat models (black-box, gray-box, white-box) for Quantum Neural Networks with empirical evaluation across circuit depths (2–50 layers) and two encoding schemes.
Empirical finding that amplitude encoding achieves highest clean accuracy (93% on MNIST) but degrades catastrophically under adversarial perturbation and depolarization noise (dropping below 5%), while angle encoding is more robust in noisy shallow circuits.
Evidence that quantum depolarization noise disrupts Hilbert-space correlations exploited by the QUID attack, suggesting NISQ noise can function as a natural defense mechanism.

🛡️ Threat Analysis

Input Manipulation Attack

Implements white-box inference-time attacks (FGSM and PGD) against QNNs, evaluating adversarial perturbation effects across circuit depths and encoding schemes — direct input manipulation attack evaluation.

Data Poisoning Attack

Implements black-box label-flipping and gray-box QUID encoder-level data poisoning attacks against QNN training pipelines, evaluating attack success rates and impact on model accuracy.