ML Security Papers

Latest papers

186 papers

benchmark arXiv Apr 4, 2026 · 4d ago

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Peijun Bao, Anwei Luo, Gang Pan et al. · Zhejiang University · Nanyang Technological University +4 more

Benchmark dataset and diffusion-based detector for localizing AI-manipulated activity segments seamlessly inserted into authentic videos

Output Integrity Attack visionmultimodal

PDF Code

benchmark arXiv Apr 3, 2026 · 5d ago

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Zhihao Chen, Ying Zhang, Yi Liu et al. · Fujian Normal University · Wake Forest University +7 more

Large-scale analysis of 17K LLM agent skills finding 520 vulnerable to credential leakage via debug logging and prompt injection

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Apr 3, 2026 · 5d ago

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Yubin Qu, Yi Liu, Tongcheng Geng et al. · Griffith University · Quantstamp +6 more

Supply-chain attack embedding malicious payloads in LLM agent skill documentation, achieving up to 33.5% bypass of defenses

AI Supply Chain Attacks Insecure Plugin Design Excessive Agency nlp

PDF

attack arXiv Mar 31, 2026 · 8d ago

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Meiwen Ding, Song Xia, Chenqi Kong et al. · Nanyang Technological University

Embeds imperceptible adversarial prompts into images via visual perturbations to jailbreak closed-source multimodal LLMs

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

defense arXiv Mar 25, 2026 · 14d ago

High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

Peipeng Yu, Jinfeng Xie, Chengfu Ou et al. · Jinan University · University of Macau +2 more

Embeds semantic watermarks in face images for copyright protection, pixel-level deepfake localization, and content recovery after manipulation

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 25, 2026 · 14d ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp

PDF

With the development of deep learning, Neural Code Models (NCMs) such as CodeBERT and CodeLlama are widely used for code understanding tasks, including defect detection and code classification. However, recent studies have revealed that NCMs are vulnerable to adversarial examples, inputs with subtle perturbations that induce incorrect predictions while remaining difficult to detect. Existing defenses address this issue via data augmentation to empirically improve robustness, but they are costly, offer no theoretical robustness guarantees, and typically require white-box access to model internals, such as gradients. To address the above challenges, we propose ENBECOME, a novel black-box training-free and lightweight adversarial defense. ENBECOME is designed to both enhance empirical robustness and report certified robustness boundaries for NCMs. ENBECOME operates solely during inference, introducing random, semantics-preserving perturbations to input code snippets to smooth the NCM's decision boundaries. This smoothing enables ENBECOME to formally certify a robustness radius within which adversarial examples can never induce misclassification, a property known as certified robustness. We conduct comprehensive experiments across multiple NCM architectures and tasks. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Furthermore, ENBECOME achieves an average certified robustness radius of 1.63, meaning that adversarial modifications to no more than 1.63 identifiers are provably ineffective.

transformer Nanjing University · University of New South Wales · Nanyang Technological University +1 more

PDF arXiv

defense arXiv Mar 25, 2026 · 14d ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal

PDF Code

defense arXiv Mar 24, 2026 · 15d ago

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily et al. · University of Science and Technology of China · Squirrel AI Learning +4 more

Mechanistic interpretability framework identifying sparse safety circuits in LLMs for backdoor removal and alignment preservation

Model Poisoning Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Mar 23, 2026 · 16d ago

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Xingyu Zhu, Beier Zhu, Shuo Wang et al. · University of Science and Technology of China · National University of Singapore +1 more

Null-space projection defense that blocks VLM jailbreaks while preserving benign performance through theoretically-grounded activation steering

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

attack arXiv Mar 20, 2026 · 19d ago

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Wenjing Hong, Zhonghua Rong, Li Wang et al. · Shenzhen University · Ltd +2 more

Automated multi-objective evolutionary search framework discovering diverse long-tail jailbreak attacks via encryption-decryption prompt transformations

Prompt Injection nlp

PDF

attack arXiv Mar 15, 2026 · 24d ago

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Ruoxi Cheng, Yizhong Ding, Hongyi Zhang et al. · Beijing Electronic Science and Technology Institute · Alibaba Group +2 more

Text-only membership inference attack on CLIP/CLAP models that detects PII memorization without exposing biometric data

Membership Inference Attack multimodalvisionaudionlp

PDF

defense arXiv Mar 13, 2026 · 26d ago

Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang, Bojun Yang, Shuo He et al. · Southeast University · Nanyang Technological University +2 more

Test-time backdoor defense for LVLMs that detects poisoned inputs via cross-modal attention anomalies and purifies them by pruning trigger tokens

Model Poisoning multimodalnlpvision

PDF

defense arXiv Mar 13, 2026 · 26d ago

Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

Yang Chen, Yi Yu, Jiaming He et al. · Nanyang Technological University · UESTC +3 more

Spectral filtering defense against data poisoning attacks that cause excessive Gaussian growth in 3D scene reconstruction

Data Poisoning Attack vision

PDF

defense arXiv Mar 13, 2026 · 26d ago

STRAP-ViT: Segregated Tokens with Randomized -- Transformations for Defense against Adversarial Patches in ViTs

Nandish Chattopadhyay, Anadi Goyal, Chandan Karfa et al. · Indian Institute of Technology · Nanyang Technological University

Detects and neutralizes adversarial patches on ViTs by identifying anomalous tokens and applying randomized transformations

Input Manipulation Attack vision

PDF

attack arXiv Mar 13, 2026 · 26d ago

CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

Shuhan Xu, Siyuan Liang, Hongling Zheng et al. · Wuhan University · Nanyang Technological University +1 more

Adversarial attack on diffusion I2V models that disrupts temporal consistency via low-dimensional velocity field perturbations

Input Manipulation Attack visiongenerative

PDF

attack arXiv Mar 12, 2026 · 27d ago

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Zikang Ding, Haomiao Yang, Meng Hao et al. · University of Electronic Science and Technology of China · Singapore Management University +2 more

Proposes temporally-delayed backdoor attacks on NLP pre-trained models using common everyday words as stealthy triggers

Model Poisoning nlp

PDF

benchmark arXiv Mar 12, 2026 · 27d ago

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Ching-Yu Kao, Xinfeng Li, Shenyu Dai et al. · Fraunhofer AISEC · Nanyang Technological University +3 more

Benchmarks documentation-embedded indirect prompt injection against high-privilege LLM agents, achieving 85% exfiltration success with 0% human detection rate

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Mar 11, 2026 · 28d ago

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li et al. · Zhejiang University · Nanyang Technological University +1 more

Runtime defense for LLM agents detecting indirect prompt injection via causal counterfactual analysis of tool invocations

Prompt Injection nlp

PDF Code

defense arXiv Mar 9, 2026 · 4w ago

Where, What, Why: Toward Explainable 3D-GS Watermarking

Mingshu Cai, Jiajun Li, Osamu Yoshie et al. · Waseda University · Southeast University +1 more

Watermarks 3D Gaussian Splatting assets with explainable carrier selection, improving visual quality by +0.83 dB and bit-accuracy by +1.24% over prior methods

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 9, 2026 · 4w ago

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Nikita Kuzmin, Tao Zhong, Jiajun Deng et al. · Nanyang Technological University · A*STAR +3 more

Defends against speaker re-identification attacks on LLM speech dialogue models using streaming voice anonymization

Sensitive Information Disclosure audionlp

PDF

Loading more papers…

Latest papers

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Adversarial Prompt Injection Attack on Multimodal Large Language Models

High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Unleashing Vision-Language Semantics for Deepfake Video Detection

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Test-Time Attention Purification for Backdoored Large Vision Language Models

Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

STRAP-ViT: Segregated Tokens with Randomized -- Transformations for Defense against Adversarial Patches in ViTs

CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Where, What, Why: Toward Explainable 3D-GS Watermarking

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue