ML Security Papers

Latest papers

12 papers

defense arXiv Mar 1, 2026 · 5w ago

S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights

Gaojie Jin, Xinping Yi, Wei Huang et al. · University of Exeter · Southeast University +1 more

Improves adversarial training robustness by optimizing second-order weight statistics via a tightened PAC-Bayesian bound

Input Manipulation Attack vision

PDF Code

defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision

PDF Code

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

transformer cnn University of Exeter · King Abdullah University of Science and Technology · Xi’an Jiaotong-Liverpool University +5 more

PDF arXiv DOI Code

attack arXiv Jan 18, 2026 · 11w ago

DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing

Jinwei Hu, Shiyuan Meng, Yi Dong et al. · University of Liverpool · Shanghai Artificial Intelligence Laboratory

Efficient adversarial attack using XAI-guided spatial targeting and temporal frame selection to reduce per-frame robustness testing overhead

Input Manipulation Attack vision

PDF

attack arXiv Jan 4, 2026 · Jan 2026

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Jinwei Hu, Xinmiao Huang, Youcheng Sun et al. · University of Liverpool · Mohamed bin Zayed University of Artificial Intelligence

Colluding LLM agents manipulate victim agents into false beliefs by coordinating truthful but deceptive evidence fragments across public channels

Prompt Injection nlp

PDF Code

defense arXiv Nov 27, 2025 · Nov 2025

Rethinking Cross-Generator Image Forgery Detection through DINOv3

Zhenglin Huang, Jason Li, Haiquan Wen et al. · University of Liverpool · Nanyang Technological University +3 more

Discovers frozen DINOv3 detects cross-generator image forgeries via low-frequency cues; proposes training-free token-ranking baseline

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Nov 13, 2025 · Nov 2025

Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation

Zhen Chen, Yi Zhang, Xiangyu Yin et al. · University of Liverpool · University of Warwick

Evaluation framework shows anti-DreamBooth adversarial image protections are trivially defeated by purification, enabling facial identity leakage

Output Integrity Attack visiongenerative

PDF Code

benchmark arXiv Nov 3, 2025 · Nov 2025

Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Zhen Chen et al. · University of Warwick · University of Liverpool +2 more

Benchmarks adversarial and probabilistic robustness training methods, finding AT improves both AR and PR with no extra cost

Input Manipulation Attack vision

1 citations PDF Code

tool arXiv Oct 21, 2025 · Oct 2025

Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Chia-Hsuan Lu, Tony Tan, Michael Benedikt · arXiv · University of Oxford +1 more

Verifies GNN robustness against structural adversarial perturbations using polynomial-time partial SAT solvers instead of MIP

Input Manipulation Attack graph

1 citations PDF Code

defense arXiv Sep 30, 2025 · Sep 2025

Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier

Gaojie Jin, Xinping Yi, Xiaowei Huang · University of Exeter · Southeast University +1 more

Derives PAC-Bayesian certified robustness bounds for smoothed majority vote classifiers and proposes spectral regularization to improve robustness-accuracy tradeoff

Input Manipulation Attack vision

1 citations PDF

defense arXiv Aug 30, 2025 · Aug 2025

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Sihao Wu, Gaojie Jin, Wei Huang et al. · University of Liverpool · University of Exeter +2 more

Defends VLMs against visual adversarial jailbreaks via adaptive activation steering vectors refined through sequence-level preference optimization

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

attack arXiv Aug 23, 2025 · Aug 2025

POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization

Xinyu Li, Tianjin Huang, Ronghui Mu et al. · University of Exeter · University of Liverpool

Black-box adversarial prompts exploit CoT reasoning to inflate LLM token generation and exhaust compute resources

Model Denial of Service nlp

PDF

defense arXiv Aug 6, 2025 · Aug 2025

RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Tianxiao Li, Zhenglin Huang, Haiquan Wen et al. · University of Liverpool · Beihang University +1 more

Novel explainable deepfake detector combining retrieval-augmented generation and GRPO RL to produce saliency maps and textual rationales

Output Integrity Attack visionmultimodal

PDF

Latest papers

S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Rethinking Cross-Generator Image Forgery Detection through DINOv3

Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation

Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization

RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue