Latest papers

151 papers
defense arXiv Apr 28, 2026 · 23d ago

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · University of Science and Technology of China +2 more

Lightweight detector that identifies prompt injection attacks in web agent screenshots using visual gradient analysis and text recovery

Prompt Injection Excessive Agency multimodalnlp
PDF
defense arXiv Apr 27, 2026 · 24d ago

Mitigating Error Amplification in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Bo Wang et al. · AnHui University · Dalian University of Technology +2 more

Dynamic guidance strategy that adjusts perturbation budgets and supervision signals during adversarial training to prevent catastrophic overfitting

Input Manipulation Attack vision
PDF
defense arXiv Apr 27, 2026 · 24d ago

Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Tianhang Zheng et al. · AnHui University · Dalian University of Technology +1 more

Interprets catastrophic overfitting in fast adversarial training as trigger-based backdoor behavior and proposes backdoor-inspired mitigation strategies

Input Manipulation Attack Model Poisoning vision
PDF
defense arXiv Apr 24, 2026 · 27d ago

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

Yongqi Jiang, Yansong Gao, Boyu Kuang et al. · Nanjing University of Science and Technology · The University of Western Australia +2 more

Embeds adversarially robust watermarks in SSL encoder weights to prove ownership in black-box downstream deployments

Model Theft vision
PDF
defense arXiv Apr 20, 2026 · 4w ago

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu et al. · The Chinese University of Hong Kong · Shanghai Jiao Tong University +3 more

Kernel-based security architecture for LLM agents that intercepts unsafe tool calls using deterministic taint tracking and dependency graphs

Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Apr 19, 2026 · 4w ago

Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection

Qihao Shen, Jiaxing Xuan, Zhenguang Liu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +4 more

Triple-branch deepfake detector using spatial and frequency features with mutual information losses for robust cross-dataset generalization

Output Integrity Attack visionmultimodal
PDF Code
tool arXiv Apr 18, 2026 · 4w ago

DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

Hongyuan Qi, Feifei Shao, Ming Li et al. · Zhejiang University · Guangming Lab

Multi-agent debate framework for detecting AI-generated videos using competing forensic hypotheses instead of supervised pattern matching

Output Integrity Attack visionmultimodalnlp
PDF
defense arXiv Apr 18, 2026 · 4w ago

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

Jiazhen Yang, Junjun Zheng, Kejia Chen et al. · Zhejiang University · Alibaba Group +1 more

Adapts visual foundation models for detecting AI-generated images by identifying optimal feature layers for forgery detection

Output Integrity Attack visionmultimodal
PDF
defense arXiv Apr 16, 2026 · 5w ago

Deepfake Detection Generalization with Diffusion Noise

Hongyuan Qi, Wenjin Hou, Hehe Fan et al. · Zhejiang University

Deepfake detector leveraging diffusion noise characteristics to generalize across generation methods, especially diffusion-based forgeries

Output Integrity Attack visiongenerative
PDF
attack arXiv Apr 16, 2026 · 5w ago

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Meng Chen, Kun Wang, Li Lu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +2 more

Adversarial audio injection attack hijacking audio-language models via imperceptible audio perturbations that generalize across contexts

Input Manipulation Attack Prompt Injection audiomultimodalnlp
PDF
attack arXiv Apr 14, 2026 · 5w ago

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu et al. · Zhejiang University · Palo Alto Networks +3 more

Stealthy LLM backdoor injection via weight editing that compiles activation steering into null-space constraints for reliable jailbreaks

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp
PDF
benchmark arXiv Apr 13, 2026 · 5w ago

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya et al. · Lomonosov Moscow State University · Shenzhen University +14 more

Competition report on robust deepfake detection across 42 generators and 36 image transformations with 20 final solutions

Output Integrity Attack visiongenerative
PDF
attack arXiv Apr 9, 2026 · 6w ago

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Wenpeng Xing, Moran Fang, Guangtai Wang et al. · Zhejiang University · Binjiang Institute of Zhejiang University +1 more

Inference-time jailbreak attack that surgically ablates safety guardrails by suppressing refusal-inducing activation patterns in LLM hidden states

Prompt Injection nlp
PDF
defense arXiv Apr 9, 2026 · 6w ago

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi, Zefeng Wu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more

Identifies safety-critical LLM parameters via gradient analysis, enabling targeted safety tuning and preservation during fine-tuning

Prompt Injection nlp
PDF Code
benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal
PDF
benchmark arXiv Apr 9, 2026 · 6w ago

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu, Chenzhuo Zhao, Changfa Mo et al. · Zhejiang University · Independent Researcher +1 more

Benchmark dataset and evaluation framework for detecting AI-generated scientific figures across multiple generator sources and degradation scenarios

Output Integrity Attack visionmultimodalnlp
PDF Code
defense arXiv Apr 7, 2026 · 6w ago

AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

Haobo Zhang, Zhenhua Xu, Junxian Li et al. · Zhejiang University of Technology · Binjiang Institute of Zhejiang University +3 more

White-box LLM fingerprinting via differential attention patterns robust to fine-tuning, pruning, and merging for provenance verification

Model Theft Model Theft nlp
PDF Code
benchmark arXiv Apr 4, 2026 · 6w ago

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Peijun Bao, Anwei Luo, Gang Pan et al. · Zhejiang University · Nanyang Technological University +4 more

Benchmark dataset and diffusion-based detector for localizing AI-manipulated activity segments seamlessly inserted into authentic videos

Output Integrity Attack visionmultimodal
PDF Code
defense arXiv Apr 2, 2026 · 7w ago

Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations

Yue Li, Linying Xue, Kaiqing Lin et al. · National Huaqiao University · Shenzhen University +2 more

Diffusion-guided adversarial perturbation defense protecting facial images from deepfake manipulation in both white-box and black-box settings

Input Manipulation Attack visiongenerative
PDF
defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision
PDF
Loading more papers…