attack 2025

Attack logics, not outputs: Towards efficient robustification of deep neural networks by falsifying concept-based properties

Raik Dankworth , Gesina Schwalbe

0 citations · 52 references · Artificial Intelligence and fO...

α

Published on arXiv

2510.03320

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Theoretically argues that falsifying concept-based logical properties (rather than output classes) reduces the adversarial search space while producing more semantically meaningful and robustness-improving adversarial examples.


Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output, such as flipping from $\texttt{stop_sign}$ to $\neg\texttt{stop_sign}$. In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints (concept-based properties) involving further human-interpretable concepts, like $\texttt{red}\wedge\texttt{octogonal}\rightarrow\texttt{stop_sign}$. For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to efficiently and simultaneously improve logical compliance and robustness.


Key Contributions

  • Generalization of adversarial attacks from class-output falsification to concept-based logical property falsification (e.g., red∧octagonal→stop_sign), aligning attacks with human-interpretable robustness targets
  • Implementation strategy for concept-based properties on already trained NNs using post-hoc XAI (concept-based explainability) techniques, without retraining
  • Theoretical sketch arguing that concept-based attacks have a smaller search space than standard class falsification, enabling more efficient adversarial example generation and robustification

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution is a new adversarial attack formulation — crafting adversarial inputs that falsify concept-based logical constraints at inference time, with reduced search space compared to standard class-flipping attacks; adversarial examples are then used for adversarial training/robustification.


Details

Domains
vision
Model Types
cnn
Threat Tags
inference_timedigital
Applications
image classificationtraffic sign recognitionautonomous driving perception