ML Security Papers

Latest papers

5 papers

attack arXiv Apr 2, 2026 · 4d ago

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa, Zihan Ye, Yang Lu et al. · University of Nottingham · Xi’an Jiaotong-Liverpool University +1 more

Low-effort prompt-based jailbreaks bypass text-to-image safety filters using linguistic reframing, achieving 74% attack success

Prompt Injection multimodalgenerative

PDF

defense arXiv Feb 19, 2026 · 6w ago

Provable Adversarial Robustness in In-Context Learning

Di Zhang · Xi’an Jiaotong-Liverpool University

Proves worst-case ICL robustness bounds showing model capacity scales sqrt(m) with tolerable adversarial distribution shift

Input Manipulation Attack nlp

PDF

defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision

PDF Code

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

transformer cnn University of Exeter · King Abdullah University of Science and Technology · Xi’an Jiaotong-Liverpool University +5 more

PDF arXiv DOI Code

defense arXiv Nov 10, 2025 · Nov 2025

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Fangqi Dai, Xingjian Jiang, Zizhuang Deng · Shandong University · Xi’an Jiaotong-Liverpool University +2 more

Novel reward-based alignment method detects LLM-revised human text by tuning scoring models toward human writing preferences

Output Integrity Attack nlp

PDF Code

defense arXiv Oct 10, 2025 · Oct 2025

VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search

MingSheng Li, Guangze Zhao, Sichen Liu · Harbin Institute of Technology · Xi’an Jiaotong-Liverpool University

Defends LVLMs against multimodal jailbreaks using MCTS-guided safety prompt trajectories embedded in the reasoning chain

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

Latest papers

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Provable Adversarial Robustness in In-Context Learning

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue