Latest papers

18 papers
defense arXiv Mar 19, 2026 · 18d ago

Functional Subspace Watermarking for Large Language Models

Zikang Ding, Junhao Li, Suling Wu et al. · University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence +1 more

Embeds ownership watermarks in a low-dimensional functional subspace of LLM weights, surviving fine-tuning, quantization, and distillation attacks

Model Theft Model Theft nlp
PDF
defense arXiv Mar 12, 2026 · 25d ago

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang, Yijiang Li, Ningzhi Gao et al. · South China University of Technology · University of California San Diego +1 more

Zero-shot inference-time backdoor detector for vision encoders using progressive masking and embedding trajectory clustering

Model Poisoning visionmultimodal
PDF
defense arXiv Feb 24, 2026 · 5w ago

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

Haonan An, Xiaohui Ye, Guang Hua et al. · South China University of Technology · Singapore Institute of Technology +1 more

Embeds face content as background watermark to robustly detect, localize, and recover manipulated face regions against removal attacks

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 5, 2026 · 8w ago

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang et al. · South China University of Technology · Pengcheng Laboratory +1 more

Defends LLM safety alignment during fine-tuning by regularizing attention sink divergence to prevent harmful pattern learning

Transfer Learning Attack nlp
PDF Code
defense arXiv Feb 3, 2026 · 8w ago

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu et al. · South China University of Technology · Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) +2 more

Training-free VLM defense that amplifies risk signals in visual tokens to block multimodal jailbreak attacks without utility loss

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
defense arXiv Dec 5, 2025 · Dec 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang et al. · South China University of Technology · Hong Kong University of Science and Technology +2 more

Defends MLLMs against multimodal indirect prompt injection by steering instruction-following behavior in activation space

Prompt Injection multimodalnlp
1 citations PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Hai Ci, Ziheng Peng, Pei Yang et al. · National University of Singapore · South China University of Technology

Benchmark dataset of 30k diffusion-edited images with pixel-level annotations for localizing AI edits via semantic segmentation

Output Integrity Attack visiongenerative
PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu et al. · South China University of Technology · Tsinghua University

Sequence-level adversarial clothing textures with UV parameterization and temporal EoT that physically evade human detection in video

Input Manipulation Attack vision
PDF
benchmark arXiv Nov 18, 2025 · Nov 2025

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin, Jirui Yang, Yukui Qiu et al. · University of California · Fudan University +1 more

Proposes latent-trajectory metric to benchmark LLM jailbreak robustness without text generation, matching red-teaming rankings at under 1% compute cost

Prompt Injection nlp
PDF
attack Pattern Recognition Nov 3, 2025 · Nov 2025

Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability

Zhixuan Zhang, Pingyu Wang, Xingjian Zheng et al. · Sichuan University · Frost Drill Intellectual Software Pte. Ltd +1 more

Black-box transferable adversarial attack using dual-order flatness to escape deceptive loss regions and boost cross-model transferability

Input Manipulation Attack vision
PDF
defense arXiv Oct 18, 2025 · Oct 2025

EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Haoran Sun, Chen Cai, Huiping Zhuang et al. · The Hong Kong Polytechnic University · Nanyang Technological University +1 more

Explainable deepfake video detector using multimodal LLaMA with spatio-temporal chain-of-thought reasoning and facial hard constraints

Output Integrity Attack visionmultimodalnlp
PDF Code
attack arXiv Oct 18, 2025 · Oct 2025

Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

Guo Li, Yuyang Yu, Xuemiao Xu · South China University of Technology

Membership inference attack on diffusion models exploiting noise aggregation patterns after small-noise injection, requiring fewer model queries

Membership Inference Attack visiongenerative
PDF
defense arXiv Oct 11, 2025 · Oct 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Guozhi Liu, Qi Mu, Tiansheng Huang et al. · South China University of Technology · Ltd. +4 more

Curates safety-critical alignment data subsets to harden LLMs against harmful fine-tuning attacks while cutting training time by ~57%

Transfer Learning Attack Prompt Injection nlp
2 citations 1 influentialPDF Code
defense arXiv Oct 9, 2025 · Oct 2025

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang et al. · South China University of Technology · Pazhou Lab +4 more

Detects AI-generated videos via physics-driven NSG statistic quantifying violations of probability flow conservation laws

Output Integrity Attack visiongenerative
6 citations 1 influentialPDF Code
tool arXiv Oct 3, 2025 · Oct 2025

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

Qing Huang, Zhipei Xu, Xuanyu Zhang et al. · Peking University · South China University of Technology

Multi-agent system that unifies deepfake, AI-image, and manipulation detection by dynamically routing to expert detectors

Output Integrity Attack visionmultimodal
PDF
defense arXiv Sep 27, 2025 · Sep 2025

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

Yu Zhang, Shuliang Liu, Xu Yang et al. · The Hong Kong University of Science and Technology (Guangzhou) · South China University of Technology

Proposes dynamic LLM text watermarking using context-aware entropy thresholds to preserve quality across mixed-modality generation tasks

Output Integrity Attack nlp
1 citations PDF
defense arXiv Sep 22, 2025 · Sep 2025

StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models

Haoxin Yang, Bangzhen Liu, Xuemiao Xu et al. · South China University of Technology · Singapore Management University +1 more

Embeds binary watermarks into diffusion model outputs for copyright protection and tampered-region localization via end-to-end VAE-forensic network co-training

Output Integrity Attack visiongenerative
1 citations PDF
attack arXiv Aug 1, 2025 · Aug 2025

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng et al. · Beihang University · The Hong Kong University of Science and Technology +3 more

Two-stage LLM jailbreak uses hidden-state activations to guide text-level edits, bypassing safety alignment with SOTA attack success rates

Prompt Injection nlp
PDF Code