Poisoning-based backdoor attacks pose significant threats to deep neural networks by embedding triggers in training data, causing models to misclassify triggered inputs as adversary-specified labels while maintaining performance on clean data. Existing poison restraint-based defenses often suffer from inadequate detection against specific attack variants and compromise model utility through unlearning methods that lead to accuracy degradation. This paper conducts a comprehensive analysis of backdoor attack dynamics during model training, revealing that poisoned samples form isolated clusters in latent space early on, with triggers acting as dominant features distinct from benign ones. Leveraging these insights, we propose Cluster Segregation Concealment (CSC), a novel poison suppression defense. CSC first trains a deep neural network via standard supervised learning while segregating poisoned samples through feature extraction from early epochs, DBSCAN clustering, and identification of anomalous clusters based on class diversity and density metrics. In the concealment stage, identified poisoned samples are relabeled to a virtual class, and the model's classifier is fine-tuned using cross-entropy loss to replace the backdoor association with a benign virtual linkage, preserving overall accuracy. CSC was evaluated on four benchmark datasets against twelve poisoning-based attacks, CSC outperforms nine state-of-the-art defenses by reducing average attack success rates to near zero with minimal clean accuracy loss. Contributions include robust backdoor patterns identification, an effective concealment mechanism, and superior empirical validation, advancing trustworthy artificial intelligence.
cnnCity University of Macau · University of Technology Sydney
Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang et al. · University of Technology Sydney · Griffith University +1 more
Attacks LLM text watermarks via character-level perturbations that disrupt tokenization, defeating five watermarking schemes with minimal detector access
Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries. To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.
llmtransformerUniversity of Technology Sydney · Griffith University · RMIT University
Semantic segmentation models are widely deployed in safety-critical applications such as autonomous driving, yet their vulnerability to backdoor attacks remains largely underexplored. Prior segmentation backdoor studies transfer threat settings from existing image classification tasks, focusing primarily on object-to-background mis-segmentation. In this work, we revisit the threats by systematically examining backdoor attacks tailored to semantic segmentation. We identify four coarse-grained attack vectors (Object-to-Object, Object-to-Background, Background-to-Object, and Background-to-Background attacks), as well as two fine-grained vectors (Instance-Level and Conditional attacks). To formalize these attacks, we introduce BADSEG, a unified framework that optimizes trigger designs and applies label manipulation strategies to maximize attack performance while preserving victim model utility. Extensive experiments across diverse segmentation architectures on benchmark datasets demonstrate that BADSEG achieves high attack effectiveness with minimal impact on clean samples. We further evaluate six representative defenses and find that they fail to reliably mitigate our attacks, revealing critical gaps in current defenses. Finally, we demonstrate that these vulnerabilities persist in recent emerging architectures, including transformer-based networks and the Segment Anything Model (SAM), thereby compromising their security. Our work reveals previously overlooked security vulnerabilities in semantic segmentation, and motivates the development of defenses tailored to segmentation-specific threat models.
cnntransformerUniversity of Technology Sydney · Griffith University · City University of Macau +1 more