defense 2025

KoALA: KL-L0 Adversarial Detector via Label Agreement

Siqi Li , Yasser Shoukry

0 citations · 36 references · arXiv

α

Published on arXiv

2510.12752

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves precision 0.94 / recall 0.81 on ResNet/CIFAR-10 and precision 0.66 / recall 0.85 on CLIP/Tiny-ImageNet, with a formal proof of detection correctness under stated conditions.

KoALA

Novel technique introduced


Deep neural networks are highly susceptible to adversarial attacks, which pose significant risks to security- and safety-critical applications. We present KoALA (KL-L0 Adversarial detection via Label Agreement), a novel, semantics-free adversarial detector that requires no architectural changes or adversarial retraining. KoALA operates on a simple principle: it detects an adversarial attack when class predictions from two complementary similarity metrics disagree. These metrics-KL divergence and an L0-based similarity-are specifically chosen to detect different types of perturbations. The KL divergence metric is sensitive to dense, low-amplitude shifts, while the L0-based similarity is designed for sparse, high-impact changes. We provide a formal proof of correctness for our approach. The only training required is a simple fine-tuning step on a pre-trained image encoder using clean images to ensure the embeddings align well with both metrics. This makes KOALA a lightweight, plug-and-play solution for existing models and various data modalities. Our extensive experiments on ResNet/CIFAR-10 and CLIP/Tiny-ImageNet confirm our theoretical claims. When the theorem's conditions are met, KoALA consistently and effectively detects adversarial examples. On the full test sets, KoALA achieves a precision of 0.94 and a recall of 0.81 on ResNet/CIFAR-10, and a precision of 0.66 and a recall of 0.85 on CLIP/Tiny-ImageNet.


Key Contributions

  • KoALA: a semantics-free adversarial detector that flags inputs when class predictions from KL-divergence and L0-based similarity metrics disagree, requiring no adversarial retraining or architectural changes
  • Formal proof of correctness for the detection approach, providing theoretical guarantees absent from most prior detectors
  • Lightweight fine-tuning of a pre-trained image encoder on clean data to align embeddings with both metrics, making KoALA a plug-and-play solution

🛡️ Threat Analysis

Input Manipulation Attack

Directly defends against adversarial input manipulation attacks — the detector flags adversarially perturbed inputs at inference time without requiring adversarial retraining or architectural changes. KL divergence targets dense low-amplitude perturbations while L0-based similarity targets sparse high-impact perturbations, covering complementary adversarial attack types.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
inference_timedigitalwhite_box
Datasets
CIFAR-10Tiny-ImageNet
Applications
image classification