Latest papers

20 papers
defense arXiv Mar 24, 2026 · 13d ago

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al. · University of Basel · KAUST +1 more

Byzantine-robust federated learning with differential privacy, proving convergence without bounded gradient assumptions using double momentum and clipping

Data Poisoning Attack federated-learning
PDF
attack arXiv Mar 16, 2026 · 21d ago

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu et al. · vLLM Semantic Router Project · MBZUAI +3 more

Introduces visual confused deputy attacks on GUI agents via screenshot manipulation and proposes dual-channel guardrails verifying both visual targets and textual reasoning

Input Manipulation Attack Output Integrity Attack Excessive Agency visionmultimodalnlp
PDF Code
defense arXiv Mar 9, 2026 · 28d ago

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Jianshu She · MBZUAI

Defends enterprise LLM agents against data leakage by splitting sensitive handling from cloud reasoning with context-aware sanitization

Sensitive Information Disclosure Insecure Plugin Design nlp
PDF
benchmark arXiv Mar 6, 2026 · 4w ago

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni et al. · Idiap Research Institute · Tallinn University of Technology +1 more

Controlled study benchmarking compact SSL backbones for audio deepfake detection with TTA-based uncertainty calibration

Output Integrity Attack audio
PDF
benchmark arXiv Mar 1, 2026 · 5w ago

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks

Masahiro Kaneko, Ayana Niwa, Timothy Baldwin · MBZUAI

Multilingual benchmark evaluating LLM jailbreak resilience for fake news generation across 34 regions, 22 languages, and 5 attack types

Prompt Injection nlp
PDF Code
tool arXiv Feb 23, 2026 · 6w ago

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan et al. · MBZUAI · Monash University

Builds an MLLM judge that evaluates reasoning fidelity of deepfake detectors, outperforming 30x larger baselines at 96.2% accuracy

Output Integrity Attack visionmultimodal
PDF Code
attack arXiv Feb 19, 2026 · 6w ago

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo et al. · MBZUAI

Improved transfer-based black-box adversarial image attack on frontier LVLMs, boosting Claude-4.0 jailbreak rate from 8% to 30%

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF Code
defense arXiv Feb 15, 2026 · 7w ago

MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai et al. · NTU · BUPT +3 more

Proposes MCPShield, a lifecycle-aware security layer defending LLM agents against malicious third-party MCP tool servers

Insecure Plugin Design nlp
PDF
defense arXiv Feb 11, 2026 · 7w ago

Collaborative Threshold Watermarking

Tameem Bakr, Anish Ambreth, Nils Lukas · MBZUAI

Threshold watermarking for federated learning: ≥t clients must collaborate to verify model ownership, preventing unilateral removal

Model Theft visionfederated-learning
PDF
attack arXiv Feb 4, 2026 · 8w ago

Expert Selections In MoE Models Reveal (Almost) As Much As Text

Amir Nuriyev, Gabriel Kulp · MBZUAI · RAND +1 more

Reconstructs user input text from MoE routing decisions alone, achieving 91.2% token recovery via a transformer decoder

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
attack arXiv Jan 13, 2026 · 11w ago

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad, Nils Lukas, Karthik Nandakumar · MBZUAI · Michigan State University

Attacks invisible image watermarks by reformulating removal as novel view synthesis using zero-shot diffusion, defeating 15 schemes without detector access.

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Dec 23, 2025 · Dec 2025

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan et al. · Harbin Institute of Technology · MBZUAI +2 more

Benchmarks indirect prompt injection attacks on LLM resume screeners and proposes LoRA-based FIDS defense achieving 26% attack reduction

Prompt Injection nlp
1 citations PDF Code
defense arXiv Dec 19, 2025 · Dec 2025

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan et al. · University of Waterloo · MBZUAI +1 more

Adapts CLIP with prompt tuning and visual adapters to detect GAN and diffusion deepfakes across 25 diverse test sets

Output Integrity Attack vision
PDF
defense arXiv Dec 8, 2025 · Dec 2025

Towards Robust Protective Perturbation against DeepFake Face Swapping

Hengyang Yao, Lin Li, Ke Sun et al. · University of Birmingham · University of Oxford +2 more

Defends faces against deepfake swapping using RL-learned robust adversarial perturbations, outperforming EOT baselines by 26%

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Dec 2, 2025 · Dec 2025

Defense That Attacks: How Robust Models Become Better Attackers

Mohamed Awad, Mahmoud Akrm, Walid Gomaa · MBZUAI · Egypt Japan University of Science and Technology +1 more

Adversarially trained models paradoxically become stronger attack surrogates, producing more transferable adversarial examples than standard models.

Input Manipulation Attack vision
PDF Code
defense IJCNLP-AACL Oct 19, 2025 · Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko, Zeerak Talat, Timothy Baldwin · MBZUAI · University of Edinburgh

Online learning defense dynamically counters iterative LLM jailbreaks via RL prompt optimization and gradient damping

Prompt Injection nlp
3 citations PDF
benchmark arXiv Oct 19, 2025 · Oct 2025

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Masahiro Kaneko, Timothy Baldwin · MBZUAI

Information-theoretic framework bounds LLM adversarial query complexity as log(1/ε)/I(Z;T), quantifying exact security cost of exposing logits or chain-of-thought

Prompt Injection Sensitive Information Disclosure nlp
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang et al. · MBZUAI · ByteDance +2 more

Benchmarks AI-text detectors on personalized LLM imitations, reveals feature-inversion failure mode, proposes diagnostic probe framework

Output Integrity Attack nlp
1 citations PDF Code
defense arXiv Oct 7, 2025 · Oct 2025

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng, Shengkun Tang, Yuanzhou Chen et al. · MBZUAI · NEC Laboratories America +1 more

Reframes LLM-generated text detection as OOD detection, treating human texts as outliers, achieving 98.3% AUROC across multilingual and adversarial settings

Output Integrity Attack nlp
1 citations PDF
benchmark arXiv Sep 2, 2025 · Sep 2025

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni et al. · Tallinn University of Technology · MBZUAI +4 more

Benchmarks 15 audio deepfake detectors across 14 datasets, exposing severe cross-domain generalization failures

Output Integrity Attack audio
PDF Code