Latest papers

15 papers
tool arXiv Mar 23, 2026 · 16d ago

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Zhilin Tu, Kemou Li, Fengpeng Li et al. · University of Electronic Science and Technology of China · University of Macau +2 more

Multi-expert ensemble detector for AI-generated images robust to degradations, using CLIP/SigLIP transformers with feature distillation

Output Integrity Attack visiongenerative
PDF
attack arXiv Mar 19, 2026 · 20d ago

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa, Clemens Grange, Bernard Ghanem · King Abdullah University of Science and Technology · Technical University of Munich

Demonstrates VLM safety decisions rely on semantic cues rather than visual understanding, enabling automated steering to bypass safety controls

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Mar 9, 2026 · 4w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more

Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas

Prompt Injection visionmultimodalnlp
PDF
defense arXiv Feb 6, 2026 · 8w ago

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang et al. · University of Macau · King Abdullah University of Science and Technology +2 more

Defends diffusion model concept erasure against adversarial prompt reactivation attacks via semantic-center-targeting adversarial erasure targets and gradient projection

Input Manipulation Attack visiongenerative
PDF Code
defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision
PDF Code
defense arXiv Dec 10, 2025 · Dec 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud et al. · King Abdullah University of Science and Technology · University of Oxford

Continual learning methods preserve LLM safety alignment during fine-tuning, outperforming existing defenses on both benign and poisoned data

Transfer Learning Attack Prompt Injection nlp
2 citations PDF
defense arXiv Nov 26, 2025 · Nov 2025

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Hongji Li, Junchi yao, Manjiang Yu et al. · Mohamed bin Zayed University of Artificial Intelligence · University of Queensland +1 more

Discovers that CoT reasoning leaks sensitive memorized data after unlearning; proposes activation-steering defense for multimodal LLMs

Sensitive Information Disclosure multimodalnlp
1 citations PDF
defense arXiv Oct 16, 2025 · Oct 2025

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang, Yibo Yang, Zhe Ren et al. · Jilin University · King Abdullah University of Science and Technology +1 more

Defends LLM safety alignment during fine-tuning by freezing safety-relevant weight subspaces and projecting adapter updates into a harmful-resistant null space

Transfer Learning Attack Prompt Injection nlp
3 citations PDF
attack arXiv Oct 15, 2025 · Oct 2025

Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models

Haochuan Xu, Yun Sing Koh, Shuhuai Huang et al. · The University of Auckland · King Abdullah University of Science and Technology +2 more

Model-agnostic adversarial patch attack disrupts cross-modal embedding alignment in Vision-Language-Action robots, causing task failures

Input Manipulation Attack visionmultimodal
6 citations PDF Code
attack arXiv Oct 3, 2025 · Oct 2025

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more

Gradient-based untargeted jailbreak attack maximizes LLM unsafety probability without fixed response targets, achieving 80% ASR in 100 iterations

Input Manipulation Attack Prompt Injection nlp
2 citations PDF Code
attack arXiv Oct 2, 2025 · Oct 2025

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more

Gradient-based jailbreak attack using adaptive harmful-response sampling as optimization targets, achieving 87% ASR on safety-aligned LLMs in 200 iterations

Input Manipulation Attack Prompt Injection nlp
2 citations PDF Code
tool arXiv Sep 18, 2025 · Sep 2025

PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Emanuele Ricco, Elia Onofri, Lorenzo Cima et al. · King Abdullah University of Science and Technology · National Research Council

Frequency-domain fingerprinting framework attributes AI-generated images to their source model with 92% accuracy across GANs and diffusion models

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Sep 17, 2025 · Sep 2025

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu, Yao Wan, Zhikun Zhang et al. · Huazhong University of Science and Technology · Zhejiang University +4 more

Defends code LLMs against sensitive training data extraction by selectively unlearning memorized PII and credentials via gradient ascent

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
attack arXiv Aug 30, 2025 · Aug 2025

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi et al. · King’s College London · The Alan Turing Institute +1 more

Reveals how chain-of-thought reasoning patterns mechanistically bypass LLM refusal via attention heads and cause safety forgetting via neuron entanglement during fine-tuning

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Aug 28, 2025 · Aug 2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah et al. · King Abdullah University of Science and Technology

Amplifies LLM jailbreak refusal via rank-one weight steering of refusal directions, no fine-tuning required

Prompt Injection nlp
PDF