ML Security Papers

Latest papers

20 papers

defense arXiv Mar 24, 2026 · 13d ago

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al. · University of Basel · KAUST +1 more

Byzantine-robust federated learning with differential privacy, proving convergence without bounded gradient assumptions using double momentum and clipping

Data Poisoning Attack federated-learning

PDF

attack arXiv Mar 16, 2026 · 21d ago

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu et al. · vLLM Semantic Router Project · MBZUAI +3 more

Introduces visual confused deputy attacks on GUI agents via screenshot manipulation and proposes dual-channel guardrails verifying both visual targets and textual reasoning

Input Manipulation Attack Output Integrity Attack Excessive Agency visionmultimodalnlp

PDF Code

defense arXiv Mar 9, 2026 · 28d ago

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Jianshu She · MBZUAI

Defends enterprise LLM agents against data leakage by splitting sensitive handling from cloud reasoning with context-aware sanitization

Sensitive Information Disclosure Insecure Plugin Design nlp

PDF

benchmark arXiv Mar 6, 2026 · 4w ago

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni et al. · Idiap Research Institute · Tallinn University of Technology +1 more

Controlled study benchmarking compact SSL backbones for audio deepfake detection with TTA-based uncertainty calibration

Output Integrity Attack audio

PDF

benchmark arXiv Mar 1, 2026 · 5w ago

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

Masahiro Kaneko, Ayana Niwa, Timothy Baldwin · MBZUAI

Multilingual benchmark evaluating LLM jailbreak resilience for fake news generation across 34 regions, 22 languages, and 5 attack types

Prompt Injection nlp

PDF Code

tool arXiv Feb 23, 2026 · 6w ago

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan et al. · MBZUAI · Monash University

Builds an MLLM judge that evaluates reasoning fidelity of deepfake detectors, outperforming 30x larger baselines at 96.2% accuracy

Output Integrity Attack visionmultimodal

PDF Code

attack arXiv Feb 19, 2026 · 6w ago

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo et al. · MBZUAI

Improved transfer-based black-box adversarial image attack on frontier LVLMs, boosting Claude-4.0 jailbreak rate from 8% to 30%

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF Code

defense arXiv Feb 15, 2026 · 7w ago

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai et al. · NTU · BUPT +3 more

Proposes MCPShield, a lifecycle-aware security layer defending LLM agents against malicious third-party MCP tool servers

Insecure Plugin Design nlp

PDF

defense arXiv Feb 11, 2026 · 7w ago

Collaborative Threshold Watermarking

Tameem Bakr, Anish Ambreth, Nils Lukas · MBZUAI

Threshold watermarking for federated learning: ≥t clients must collaborate to verify model ownership, preventing unilateral removal

Model Theft visionfederated-learning

PDF

attack arXiv Feb 4, 2026 · 8w ago

Expert Selections In MoE Models Reveal (Almost) As Much As Text

Amir Nuriyev, Gabriel Kulp · MBZUAI · RAND +1 more

Reconstructs user input text from MoE routing decisions alone, achieving 91.2% token recovery via a transformer decoder

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

attack arXiv Jan 13, 2026 · 11w ago

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad, Nils Lukas, Karthik Nandakumar · MBZUAI · Michigan State University

Attacks invisible image watermarks by reformulating removal as novel view synthesis using zero-shot diffusion, defeating 15 schemes without detector access.

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Dec 23, 2025 · Dec 2025

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan et al. · Harbin Institute of Technology · MBZUAI +2 more

Benchmarks indirect prompt injection attacks on LLM resume screeners and proposes LoRA-based FIDS defense achieving 26% attack reduction

Prompt Injection nlp

1 citations PDF Code

defense arXiv Dec 19, 2025 · Dec 2025

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan et al. · University of Waterloo · MBZUAI +1 more

Adapts CLIP with prompt tuning and visual adapters to detect GAN and diffusion deepfakes across 25 diverse test sets

Output Integrity Attack vision

PDF

defense arXiv Dec 8, 2025 · Dec 2025

Towards Robust Protective Perturbation against DeepFake Face Swapping

Hengyang Yao, Lin Li, Ke Sun et al. · University of Birmingham · University of Oxford +2 more

Defends faces against deepfake swapping using RL-learned robust adversarial perturbations, outperforming EOT baselines by 26%

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Dec 2, 2025 · Dec 2025

Defense That Attacks: How Robust Models Become Better Attackers

Mohamed Awad, Mahmoud Akrm, Walid Gomaa · MBZUAI · Egypt Japan University of Science and Technology +1 more

Adversarially trained models paradoxically become stronger attack surrogates, producing more transferable adversarial examples than standard models.

Input Manipulation Attack vision

PDF Code

defense IJCNLP-AACL Oct 19, 2025 · Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko, Zeerak Talat, Timothy Baldwin · MBZUAI · University of Edinburgh

Online learning defense dynamically counters iterative LLM jailbreaks via RL prompt optimization and gradient damping

Prompt Injection nlp

3 citations PDF

benchmark arXiv Oct 19, 2025 · Oct 2025

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Masahiro Kaneko, Timothy Baldwin · MBZUAI

Information-theoretic framework bounds LLM adversarial query complexity as log(1/ε)/I(Z;T), quantifying exact security cost of exposing logits or chain-of-thought

Prompt Injection Sensitive Information Disclosure nlp

PDF

benchmark arXiv Oct 14, 2025 · Oct 2025

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang et al. · MBZUAI · ByteDance +2 more

Benchmarks AI-text detectors on personalized LLM imitations, reveals feature-inversion failure mode, proposes diagnostic probe framework

Output Integrity Attack nlp

1 citations PDF Code

defense arXiv Oct 7, 2025 · Oct 2025

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng, Shengkun Tang, Yuanzhou Chen et al. · MBZUAI · NEC Laboratories America +1 more

Reframes LLM-generated text detection as OOD detection, treating human texts as outliers, achieving 98.3% AUROC across multilingual and adversarial settings

Output Integrity Attack nlp

1 citations PDF

The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

llm transformer MBZUAI · NEC Laboratories America · University of California

PDF arXiv DOI

benchmark arXiv Sep 2, 2025 · Sep 2025

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni et al. · Tallinn University of Technology · MBZUAI +4 more

Benchmarks 15 audio deepfake detectors across 14 datasets, exposing severe cross-domain generalization failures

Output Integrity Attack audio

PDF Code

Latest papers

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Collaborative Threshold Watermarking

Expert Selections In MoE Models Reveal (Almost) As Much As Text

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Towards Robust Protective Perturbation against DeepFake Face Swapping

Defense That Attacks: How Robust Models Become Better Attackers

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue