ML Security Papers

Latest papers

151 papers

defense arXiv Apr 28, 2026 · 23d ago

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · University of Science and Technology of China +2 more

Lightweight detector that identifies prompt injection attacks in web agent screenshots using visual gradient analysis and text recovery

Prompt Injection Excessive Agency multimodalnlp

PDF

defense arXiv Apr 27, 2026 · 24d ago

Mitigating Error Amplification in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Bo Wang et al. · AnHui University · Dalian University of Technology +2 more

Dynamic guidance strategy that adjusts perturbation budgets and supervision signals during adversarial training to prevent catastrophic overfitting

Input Manipulation Attack vision

PDF

defense arXiv Apr 27, 2026 · 24d ago

Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Tianhang Zheng et al. · AnHui University · Dalian University of Technology +1 more

Interprets catastrophic overfitting in fast adversarial training as trigger-based backdoor behavior and proposes backdoor-inspired mitigation strategies

Input Manipulation Attack Model Poisoning vision

PDF

defense arXiv Apr 24, 2026 · 27d ago

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

Yongqi Jiang, Yansong Gao, Boyu Kuang et al. · Nanjing University of Science and Technology · The University of Western Australia +2 more

Embeds adversarially robust watermarks in SSL encoder weights to prove ownership in black-box downstream deployments

Model Theft vision

PDF

Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy the following two practical requirements: (1) provide ownership verification capability under black-box suspect model access once the stolen encoders are used in downstream tasks; (2) be robust under adversarial watermark detection or removal, because the watermark samples form a distinguishable out-of-distribution (OOD) cluster. We propose ArmSSL, an SSL watermarking framework that assures black-box verifiability and adversarial robustness while preserving utility. For verification, we introduce paired discrepancy enlargement, enforcing feature-space orthogonality between the clean and its watermark counterpart to produce a reliable verification signal in black-box against the suspect model. For adversarial robustness, ArmSSL integrates latent representation entanglement and distribution alignment to suppress the OOD clustering. The former entangles watermark representations with clean representations (i.e., from non-source-class) to avoid forming a dense cluster of watermark samples, while the latter minimizes the distributional discrepancy between watermark and clean representations, thereby disguising watermark samples as natural in-distribution data. For utility, a reference-guided watermark tuning strategy is designed to allow the watermark to be learned as a small side task without affecting the main task by aligning the watermarked encoder's outputs with those of the original clean encoder on normal data. Extensive experiments across five mainstream SSL frameworks and nine benchmark datasets, along with end-to-end comparisons with SOTAs, demonstrate that ArmSSL achieves superior ownership verification, negligible utility degradation, and strong robustness against various adversarial detection and removal.

cnn transformer Nanjing University of Science and Technology · The University of Western Australia · Zhejiang University +1 more

PDF arXiv

defense arXiv Apr 20, 2026 · 4w ago

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu et al. · The Chinese University of Hong Kong · Shanghai Jiao Tong University +3 more

Kernel-based security architecture for LLM agents that intercepts unsafe tool calls using deterministic taint tracking and dependency graphs

Insecure Plugin Design Excessive Agency nlp

PDF Code

defense arXiv Apr 19, 2026 · 4w ago

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

Qihao Shen, Jiaxing Xuan, Zhenguang Liu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +4 more

Triple-branch deepfake detector using spatial and frequency features with mutual information losses for robust cross-dataset generalization

Output Integrity Attack visionmultimodal

PDF Code

Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.

cnn generative gan Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security · Ltd. +3 more

PDF arXiv Code

tool arXiv Apr 18, 2026 · 4w ago

DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

Hongyuan Qi, Feifei Shao, Ming Li et al. · Zhejiang University · Guangming Lab

Multi-agent debate framework for detecting AI-generated videos using competing forensic hypotheses instead of supervised pattern matching

Output Integrity Attack visionmultimodalnlp

PDF

defense arXiv Apr 18, 2026 · 4w ago

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

Jiazhen Yang, Junjun Zheng, Kejia Chen et al. · Zhejiang University · Alibaba Group +1 more

Adapts visual foundation models for detecting AI-generated images by identifying optimal feature layers for forgery detection

Output Integrity Attack visionmultimodal

PDF

defense arXiv Apr 16, 2026 · 5w ago

Deepfake Detection Generalization with Diffusion Noise

Hongyuan Qi, Wenjin Hou, Hehe Fan et al. · Zhejiang University

Deepfake detector leveraging diffusion noise characteristics to generalize across generation methods, especially diffusion-based forgeries

Output Integrity Attack visiongenerative

PDF

attack arXiv Apr 16, 2026 · 5w ago

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Meng Chen, Kun Wang, Li Lu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +2 more

Adversarial audio injection attack hijacking audio-language models via imperceptible audio perturbations that generalize across contexts

Input Manipulation Attack Prompt Injection audiomultimodalnlp

PDF

attack arXiv Apr 14, 2026 · 5w ago

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu et al. · Zhejiang University · Palo Alto Networks +3 more

Stealthy LLM backdoor injection via weight editing that compiles activation steering into null-space constraints for reliable jailbreaks

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp

PDF

benchmark arXiv Apr 13, 2026 · 5w ago

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya et al. · Lomonosov Moscow State University · Shenzhen University +14 more

Competition report on robust deepfake detection across 42 generators and 36 image transformations with 20 final solutions

Output Integrity Attack visiongenerative

PDF

attack arXiv Apr 9, 2026 · 6w ago

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Wenpeng Xing, Moran Fang, Guangtai Wang et al. · Zhejiang University · Binjiang Institute of Zhejiang University +1 more

Inference-time jailbreak attack that surgically ablates safety guardrails by suppressing refusal-inducing activation patterns in LLM hidden states

Prompt Injection nlp

PDF

defense arXiv Apr 9, 2026 · 6w ago

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi, Zefeng Wu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more

Identifies safety-critical LLM parameters via gradient analysis, enabling targeted safety tuning and preservation during fine-tuning

Prompt Injection nlp

PDF Code

benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu, Chenzhuo Zhao, Changfa Mo et al. · Zhejiang University · Independent Researcher +1 more

Benchmark dataset and evaluation framework for detecting AI-generated scientific figures across multiple generator sources and degradation scenarios

Output Integrity Attack visionmultimodalnlp

PDF Code

defense arXiv Apr 7, 2026 · 6w ago

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

Haobo Zhang, Zhenhua Xu, Junxian Li et al. · Zhejiang University of Technology · Binjiang Institute of Zhejiang University +3 more

White-box LLM fingerprinting via differential attention patterns robust to fine-tuning, pruning, and merging for provenance verification

Model Theft Model Theft nlp

PDF Code

benchmark arXiv Apr 4, 2026 · 6w ago

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Peijun Bao, Anwei Luo, Gang Pan et al. · Zhejiang University · Nanyang Technological University +4 more

Benchmark dataset and diffusion-based detector for localizing AI-manipulated activity segments seamlessly inserted into authentic videos

Output Integrity Attack visionmultimodal

PDF Code

defense arXiv Apr 2, 2026 · 7w ago

Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations

Yue Li, Linying Xue, Kaiqing Lin et al. · National Huaqiao University · Shenzhen University +2 more

Diffusion-guided adversarial perturbation defense protecting facial images from deepfake manipulation in both white-box and black-box settings

Input Manipulation Attack visiongenerative

PDF

defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision

PDF

Loading more papers…

Latest papers

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mitigating Error Amplification in Fast Adversarial Training

Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

Deepfake Detection Generalization with Diffusion Noise

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue