ML Security Papers

Latest papers

16 papers

defense arXiv Apr 19, 2026 · 4w ago

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

Qihao Shen, Jiaxing Xuan, Zhenguang Liu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +4 more

Triple-branch deepfake detector using spatial and frequency features with mutual information losses for robust cross-dataset generalization

Output Integrity Attack visionmultimodal

PDF Code

Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.

cnn generative gan Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security · Ltd. +3 more

PDF arXiv Code

defense arXiv Apr 18, 2026 · 4w ago

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

Jiazhen Yang, Junjun Zheng, Kejia Chen et al. · Zhejiang University · Alibaba Group +1 more

Adapts visual foundation models for detecting AI-generated images by identifying optimal feature layers for forgery detection

Output Integrity Attack visionmultimodal

PDF

defense arXiv Apr 7, 2026 · 6w ago

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

Haobo Zhang, Zhenhua Xu, Junxian Li et al. · Zhejiang University of Technology · Binjiang Institute of Zhejiang University +3 more

White-box LLM fingerprinting via differential attention patterns robust to fine-tuning, pruning, and merging for provenance verification

Model Theft Model Theft nlp

PDF Code

defense arXiv Mar 26, 2026 · 8w ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp

PDF

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

llm The Hong Kong University of Science and Technology · Zhejiang University of Technology

PDF arXiv

benchmark ACM MM Feb 11, 2026 · Feb 2026

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Hanzhe Yu, Yun Ye, Jintao Rong et al. · Zhejiang University of Technology · Intel Corporation +1 more

Proposes RealHD, a 730K-image benchmark dataset and NLM noise-entropy detector for robust AI-generated image detection

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Jan 17, 2026 · Jan 2026

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp

1 citations PDF

defense arXiv Jan 13, 2026 · Jan 2026

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

Zhenhua Xu, Haobo Zhang, Zhebo Wang et al. · Zhejiang University · GenTel.io +1 more

Fingerprints LLMs for ownership verification using targeted unlearning to embed stealthy, trigger-free provenance traces

Model Theft nlp

2 citations PDF Code

attack arXiv Dec 24, 2025 · Dec 2025

Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

Xinjie Xu, Shuyu Cheng, Dongwei Xu et al. · Zhejiang University of Technology · Binjiang Institute of Artificial Intelligence +1 more

Momentum-based hard-label black-box attack using Nesterov acceleration achieves O(1/T²) convergence, outperforming 13 SOTA methods on ImageNet and CIFAR-10.

Input Manipulation Attack vision

PDF Code

survey arXiv Nov 19, 2025 · Nov 2025

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

Zimo Ji, Xunguang Wang, Zongjie Li et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology +3 more

SoK paper taxonomizes IPI defenses for LLM agents, identifies six bypass root causes, and proposes three novel adaptive attacks

Prompt Injection nlp

PDF

defense arXiv Nov 15, 2025 · Nov 2025

MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization

Runhao Jiang, Chengzhi Jiang, Rui Yan et al. · Zhejiang University · Zhejiang University of Technology

Defends spiking neural networks against adversarial attacks by regularizing membrane potential distribution to reduce gradient sensitivity

Input Manipulation Attack vision

2 citations PDF

defense CCS Nov 11, 2025 · Nov 2025

Provable Repair of Deep Neural Network Defects by Preimage Synthesis and Property Refinement

Jianan Ma, Jingyi Wang, Qi Xuan et al. · Hangzhou Dianzi University · Zhejiang University +1 more

Provable neural network repair framework using preimage synthesis to fix backdoor, adversarial, and safety defects with formal guarantees

Model Poisoning Input Manipulation Attack vision

PDF Code

benchmark arXiv Nov 10, 2025 · Nov 2025

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin et al. · Zhejiang University of Technology · Hong Kong University of Science and Technology +3 more

Benchmark evaluating teacher-persona jailbreaks on LLMs, revealing a scaling paradox where mid-sized models are most vulnerable

Prompt Injection nlp

PDF Code

benchmark arXiv Oct 26, 2025 · Oct 2025

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng, Zirui Pang, Ling li et al. · Harbin Institute of Technology · University of Illinois Urbana-Champaign +5 more

Benchmarks MLLM unlearning and reveals all methods leak supposedly-erased misinformation via adversarial recovery and prompt attacks

Sensitive Information Disclosure Prompt Injection multimodalnlpvision

PDF Code

benchmark arXiv Oct 11, 2025 · Oct 2025

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Zonghao Ying, Yangguang Shao, Jianle Gan et al. · Beihang University · Chinese Academy of Sciences +7 more

Benchmark evaluating LVLM web agent security across six attack vectors in realistic web environments, exposing universal vulnerabilities across 9 models

Prompt Injection Excessive Agency multimodalnlp

5 citations PDF

tool arXiv Sep 30, 2025 · Sep 2025

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang, Qinzhi Peng, Gan Xu et al. · Southeast University · RealAI +3 more

Builds a VLM content moderation tool and MCTS red-teaming framework for detecting harmful multi-turn multimodal dialogues

Prompt Injection multimodalnlp

1 citations PDF

benchmark arXiv Aug 21, 2025 · Aug 2025

EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang, Mingzi Zhang, Sheng Jin et al. · Zhejiang University of Technology · The Hong Kong University of Science and Technology +2 more

Benchmarks teacher-role LLM safety via personality profiling, moral dilemmas, and soft prompt injection vulnerability testing across 14 models

Prompt Injection nlp

PDF Code

Latest papers

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization

Provable Repair of Deep Neural Network Defects by Preimage Synthesis and Property Refinement

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue