ML Security Papers

Latest papers

15 papers

tool arXiv Mar 23, 2026 · 16d ago

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Zhilin Tu, Kemou Li, Fengpeng Li et al. · University of Electronic Science and Technology of China · University of Macau +2 more

Multi-expert ensemble detector for AI-generated images robust to degradations, using CLIP/SigLIP transformers with feature distillation

Output Integrity Attack visiongenerative

PDF

The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild'' conditions, offering an effective and practical solution for real-world deepfake image detection.

diffusion gan transformer University of Electronic Science and Technology of China · University of Macau · King Abdullah University of Science and Technology +1 more

PDF arXiv

attack arXiv Mar 19, 2026 · 20d ago

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa, Clemens Grange, Bernard Ghanem · King Abdullah University of Science and Technology · Technical University of Munich

Demonstrates VLM safety decisions rely on semantic cues rather than visual understanding, enabling automated steering to bypass safety controls

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

defense arXiv Mar 9, 2026 · 4w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more

Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas

Prompt Injection visionmultimodalnlp

PDF

defense arXiv Feb 6, 2026 · 8w ago

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang et al. · University of Macau · King Abdullah University of Science and Technology +2 more

Defends diffusion model concept erasure against adversarial prompt reactivation attacks via semantic-center-targeting adversarial erasure targets and gradient projection

Input Manipulation Attack visiongenerative

PDF Code

defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision

PDF Code

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

transformer cnn University of Exeter · King Abdullah University of Science and Technology · Xi’an Jiaotong-Liverpool University +5 more

PDF arXiv DOI Code

defense arXiv Dec 10, 2025 · Dec 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud et al. · King Abdullah University of Science and Technology · University of Oxford

Continual learning methods preserve LLM safety alignment during fine-tuning, outperforming existing defenses on both benign and poisoned data

Transfer Learning Attack Prompt Injection nlp

2 citations PDF

defense arXiv Nov 26, 2025 · Nov 2025

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Hongji Li, Junchi yao, Manjiang Yu et al. · Mohamed bin Zayed University of Artificial Intelligence · University of Queensland +1 more

Discovers that CoT reasoning leaks sensitive memorized data after unlearning; proposes activation-steering defense for multimodal LLMs

Sensitive Information Disclosure multimodalnlp

1 citations PDF

defense arXiv Oct 16, 2025 · Oct 2025

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang, Yibo Yang, Zhe Ren et al. · Jilin University · King Abdullah University of Science and Technology +1 more

Defends LLM safety alignment during fine-tuning by freezing safety-relevant weight subspaces and projecting adapter updates into a harmful-resistant null space

Transfer Learning Attack Prompt Injection nlp

3 citations PDF

attack arXiv Oct 15, 2025 · Oct 2025

Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

Haochuan Xu, Yun Sing Koh, Shuhuai Huang et al. · The University of Auckland · King Abdullah University of Science and Technology +2 more

Model-agnostic adversarial patch attack disrupts cross-modal embedding alignment in Vision-Language-Action robots, causing task failures

Input Manipulation Attack visionmultimodal

6 citations PDF Code

attack arXiv Oct 3, 2025 · Oct 2025

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more

Gradient-based untargeted jailbreak attack maximizes LLM unsafety probability without fixed response targets, achieving 80% ASR in 100 iterations

Input Manipulation Attack Prompt Injection nlp

2 citations PDF Code

attack arXiv Oct 2, 2025 · Oct 2025

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more

Gradient-based jailbreak attack using adaptive harmful-response sampling as optimization targets, achieving 87% ASR on safety-aligned LLMs in 200 iterations

Input Manipulation Attack Prompt Injection nlp

2 citations PDF Code

tool arXiv Sep 18, 2025 · Sep 2025

PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Emanuele Ricco, Elia Onofri, Lorenzo Cima et al. · King Abdullah University of Science and Technology · National Research Council

Frequency-domain fingerprinting framework attributes AI-generated images to their source model with 92% accuracy across GANs and diffusion models

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Sep 17, 2025 · Sep 2025

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu, Yao Wan, Zhikun Zhang et al. · Huazhong University of Science and Technology · Zhejiang University +4 more

Defends code LLMs against sensitive training data extraction by selectively unlearning memorized PII and credentials via gradient ascent

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Aug 30, 2025 · Aug 2025

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi et al. · King’s College London · The Alan Turing Institute +1 more

Reveals how chain-of-thought reasoning patterns mechanistically bypass LLM refusal via attention heads and cause safety forgetting via neuron entanglement during fine-tuning

Transfer Learning Attack Prompt Injection nlp

PDF

defense arXiv Aug 28, 2025 · Aug 2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah et al. · King Abdullah University of Science and Technology

Amplifies LLM jailbreak refusal via rank-one weight steering of refusal directions, no fine-tuning required

Prompt Injection nlp

PDF

Latest papers

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

Untargeted Jailbreak Attack

Dynamic Target Attack

PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue