defense 2026

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Yanna Jiang 1, Guangsheng Yu 1, Qingyuan Yu 2, Yi Chen 3, Qin Wang 1,4

0 citations

α

Published on arXiv

2603.12679

Model Theft

OWASP ML Top 10 — ML05

Key Finding

Achieves 100% watermark recovery success rate against composed and extended NSO attacks while preserving task utility

Canon

Novel technique introduced


Neural Structural Obfuscation (NSO) (USENIX Security'23) is a family of ``zero cost'' structure-editing transforms (\texttt{nso\_zero}, \texttt{nso\_clique}, \texttt{nso\_split}) that inject dummy neurons. By combining neuron permutation and parameter scaling, NSO makes a radical modification to the network structure and parameters while strictly preserving functional equivalence, thereby disrupting white-box watermark verification. This capability has been a fundamental challenge to the reliability of existing white-box watermarking schemes. We rethink NSO and, for the first time, fully recover from the damage it has caused. We redefine NSO as a graph-consistent threat model within a \textit{producer--consumer} paradigm. This formulation posits that any obfuscation of a producer node necessitates a compatible layout update in all downstream consumers to maintain structural integrity. Building on these consistency constraints on signal propagation, we present \textsc{Canon}, a recovery framework that probes the attacked model to identify redundancy/dummy channels and then \textit{globally} canonicalizes the network by rewriting \textit{all} downstream consumers by construction, synchronizing layouts across \texttt{fan-out}, \texttt{add}, and \texttt{cat}. Extensive experiments demonstrate that, even under strong composed and extended NSO attacks, \textsc{Canon} achieves \textbf{100\%} recovery success, restoring watermark verifiability while preserving task utility. Our code is available at https://anonymous.4open.science/r/anti-NSO-9874.


Key Contributions

  • First framework to fully recover from Neural Structural Obfuscation attacks on white-box watermarks
  • Redefines NSO as graph-consistent threat model with producer-consumer paradigm and signal propagation constraints
  • Canon recovery framework that globally canonicalizes networks by identifying dummy neurons and rewriting downstream consumers

🛡️ Threat Analysis

Model Theft

The paper defends white-box model watermarks against removal attacks. Model watermarks embedded in network weights prove ownership when a model is stolen, making this a model theft defense. NSO attacks remove these watermarks, and Canon recovers them.


Details

Domains
vision
Model Types
cnn
Threat Tags
white_boxtraining_time
Applications
model ip protectionmodel ownership verification