GradID: Adversarial Detection via Intrinsic Dimensionality of Gradients

Despite their remarkable performance, deep neural networks exhibit a critical vulnerability: small, often imperceptible, adversarial perturbations can lead to drastically altered model predictions. Given the stringent reliability demands of applications such as medical diagnosis and autonomous driving, robust detection of such adversarial attacks is paramount. In this paper, we investigate the geometric properties of a model's input loss landscape. We analyze the Intrinsic Dimensionality (ID) of the model's gradient parameters, which quantifies the minimal number of coordinates required to describe the data points on their underlying manifold. We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method. We validate our approach across two distinct operational scenarios. First, in a batch-wise context for identifying malicious data groups, our method demonstrates high efficacy on datasets like MNIST and SVHN. Second, in the critical individual-sample setting, we establish new state-of-the-art results on challenging benchmarks such as CIFAR-10 and MS COCO. Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92\% on CIFAR-10. The results underscore the robustness of our geometric approach, highlighting that intrinsic dimensionality is a powerful fingerprint for adversarial detection across diverse datasets and attack strategies.

Key Contributions

Introduces intrinsic dimensionality (ID) of model gradient parameters as a quantitative metric for characterizing loss landscape sharpness, revealing a consistent and discriminative gap between natural and adversarial samples.
Proposes GradID, a detection algorithm supporting both batch-wise (distribution-based) and individual-sample settings, enabling flexible deployment across operational contexts.
Achieves state-of-the-art adversarial detection rates (above 92%) on CIFAR-10 against a wide range of attacks including CW and AutoAttack, outperforming prior detectors.

🛡️ Threat Analysis

Input Manipulation Attack

The paper's primary contribution is a defense against adversarial examples — gradient-based perturbations crafted to cause misclassification at inference time. GradID detects these adversarial inputs by exploiting the geometric signature (intrinsic dimensionality) of the loss landscape's gradient space, evaluated against attacks including FGSM, CW, and AutoAttack.