ML Security Papers

Latest papers

24 papers

defense arXiv Mar 24, 2026 · 15d ago

Robust Safety Monitoring of Language Models via Activation Watermarking

Toluwani Aremu, Daniil Ognev, Samuele Poppi et al. · Mohamed bin Zayed University of Artificial Intelligence

Activation watermarking defense that detects adaptive jailbreak attacks on LLM safety monitors with 52% improvement over baselines

Prompt Injection nlp

PDF

defense arXiv Mar 19, 2026 · 20d ago

Functional Subspace Watermarking for Large Language Models

Zikang Ding, Junhao Li, Suling Wu et al. · University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence +1 more

Embeds ownership watermarks in a low-dimensional functional subspace of LLM weights, surviving fine-tuning, quantization, and distillation attacks

Model Theft Model Theft nlp

PDF

defense arXiv Mar 12, 2026 · 27d ago

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Arman Bolatov, Samuel Horváth, Martin Takáč et al. · Mohamed bin Zayed University of Artificial Intelligence

Byzantine-robust federated learning algorithm using normalized momentum to defend against malicious worker updates under gradient heterogeneity

Data Poisoning Attack federated-learningvisionnlp

PDF

defense arXiv Mar 9, 2026 · 4w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more

Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas

Prompt Injection visionmultimodalnlp

PDF

benchmark arXiv Feb 9, 2026 · 8w ago

Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection

Janek Bevendorff, Maik Fröbe, André Greiner-Petter et al. · Bauhaus-Universität Weimar · Friedrich Schiller University Jena +8 more

Benchmark workshop organizing five shared tasks for AI-text detection, watermarking robustness, and LLM reasoning safety evaluation

Output Integrity Attack Prompt Injection nlpgenerative

PDF

benchmark arXiv Feb 2, 2026 · 9w ago

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel, Dilshod Azizov, Indraneil Paul et al. · Mohamed bin Zayed University of Artificial Intelligence · TU Darmstadt +1 more

Large-scale benchmark revealing AI-generated code detectors fail severely under distribution shift and adversarial conditions

Output Integrity Attack nlp

PDF Code

attack arXiv Jan 11, 2026 · 12w ago

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Hongyan Chang, Ergute Bao, Xinjian Luo et al. · Mohamed bin Zayed University of Artificial Intelligence

Black-box adversarial document injection guarantees retrieval of malicious IPI content in RAG systems, enabling SSH key exfiltration via GPT-4o with 80%+ success

Input Manipulation Attack Prompt Injection nlp

2 citations PDF

attack arXiv Jan 11, 2026 · 12w ago

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Masahiro Kaneko · Mohamed bin Zayed University of Artificial Intelligence

Black-box paraphrasing attack inflates LLM-as-a-Reviewer scores without altering manuscript claims or injecting hidden instructions

Prompt Injection nlp

1 citations PDF

attack arXiv Jan 4, 2026 · Jan 2026

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Jinwei Hu, Xinmiao Huang, Youcheng Sun et al. · University of Liverpool · Mohamed bin Zayed University of Artificial Intelligence

Colluding LLM agents manipulate victim agents into false beliefs by coordinating truthful but deceptive evidence fragments across public channels

Prompt Injection nlp

PDF Code

defense arXiv Dec 17, 2025 · Dec 2025

Robust and Calibrated Detection of Authentic Multimedia Content

Sarim Hashmi, Abdelrahman Elsayed, Mohammed Talha Alam et al. · Mohamed bin Zayed University of Artificial Intelligence

Resynthesis-based deepfake detector with calibrated low false-positive rates and robustness against adaptive evasion adversaries across modalities

Output Integrity Attack visionmultimodalgenerative

1 citations PDF

defense arXiv Dec 12, 2025 · Dec 2025

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Samar Fares, Nurbek Tastan, Karthik Nandakumar · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University

In-generation video watermarking via LoRA parameter displacement to track provenance of diffusion-generated videos

Output Integrity Attack visiongenerative

1 citations PDF

defense arXiv Dec 8, 2025 · Dec 2025

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

Ziming Hong, Tianyu Huang, Runnan Chen et al. · The University of Sydney · University of Technology Sydney +3 more

Defends 3D Gaussian Splatting assets from AI editing by lifting adversarial perturbations from 2D image space into 3D Gaussian parameters

Input Manipulation Attack visiongenerative

4 citations PDF Code

defense arXiv Dec 7, 2025 · Dec 2025

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Longjie Zhao, Ziming Hong, Zhenyang Ren et al. · The University of Sydney · The University of Melbourne +1 more

Embeds robust watermarks into 3DGS scenes resistant to diffusion-based editing via low-frequency Gaussian targeting and adversarial training

Output Integrity Attack visiongenerative

1 citations 1 influentialPDF

defense arXiv Dec 4, 2025 · Dec 2025

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

L. D. M. S. Sai Teja, N. Siva Gopala Krishna, Ufaq Khan et al. · National Institute of Technology Silchar · BML Munjal University +1 more

Defends mixed-authorship AI text detectors against adversarial evasion using Info-Mask segmentation and interpretable stylometric attribution overlays

Output Integrity Attack Input Manipulation Attack nlp

PDF Code

defense arXiv Nov 26, 2025 · Nov 2025

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Hongji Li, Junchi yao, Manjiang Yu et al. · Mohamed bin Zayed University of Artificial Intelligence · University of Queensland +1 more

Discovers that CoT reasoning leaks sensitive memorized data after unlearning; proposes activation-steering defense for multimodal LLMs

Sensitive Information Disclosure multimodalnlp

1 citations PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more

Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning

Output Integrity Attack Transfer Learning Attack visiongenerative

PDF

defense arXiv Nov 3, 2025 · Nov 2025

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian et al. · The Hong Kong University of Science and Technology · Hong Kong Baptist University +4 more

Proposes ConV, a generated-image detector exploiting data manifold geometry requiring no generated training samples

Output Integrity Attack visiongenerative

2 citations PDF Code

benchmark arXiv Oct 22, 2025 · Oct 2025

Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko et al. · Institute of Science Tokyo · University of Pennsylvania +1 more

Proves MIAs and machine text detectors share the same optimal metric, demonstrating strong cross-task transferability with a unified evaluation suite.

Membership Inference Attack Output Integrity Attack nlp

1 citations 1 influentialPDF Code

attack arXiv Oct 17, 2025 · Oct 2025

Constrained Adversarial Perturbation

Virendra Nishad, Bhaskar Mukhoty, Hilal AlQuabeh et al. · Indian Institute of Technology Kanpur · Indian Institute of Technology Delhi +2 more

Proposes CAP, constraint-aware universal adversarial perturbations for tabular domains via augmented Lagrangian min-max optimization

Input Manipulation Attack tabular

PDF

attack arXiv Oct 15, 2025 · Oct 2025

Personal Attribute Leakage in Federated Speech Models

Hamdan Al-Ali, Ali Reza Ghavamipour, Tommaso Caselli et al. · Mohamed bin Zayed University of Artificial Intelligence · Maastricht University +2 more

Infers private personal attributes from federated ASR model weight differentials using shadow models and centroid classification

Model Inversion Attack audiofederated-learning

PDF

Loading more papers…

Latest papers

Robust Safety Monitoring of Language Models via Activation Watermarking

Functional Subspace Watermarking for Large Language Models

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Robust and Calibrated Detection of Authentic Multimedia Content

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Detecting Generated Images by Fitting Natural Image Distributions

Machine Text Detectors are Membership Inference Attacks

Constrained Adversarial Perturbation

Personal Attribute Leakage in Federated Speech Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue