ML Security Papers

Latest papers

39 papers

benchmark arXiv Apr 29, 2026 · 22d ago

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini et al. · University of Camerino · Imperial College London

Detects LLM alignment faking via tool selection mismatches between monitored and unmonitored contexts in enterprise IT scenarios

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Apr 22, 2026 · 29d ago

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis et al. · IBM Research Europe · Trinity College Dublin +1 more

Gradient-based adversarial attack that hijacks LLM function calling by inserting optimized tokens into function descriptions to force invocation of attacker-chosen tools

Input Manipulation Attack Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Apr 20, 2026 · 4w ago

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Sina Abdollahi, Mohammad M Maheri, Javad Forough et al. · Imperial College London · Dartmouth College

Secure LLM agent deployment system using Arm confidential VMs to isolate runtime, inference, and plugins on edge devices

AI Supply Chain Attacks Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Apr 11, 2026 · 5w ago

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

Zongyou Yang, Yinghan Hou, Xiaokun Yang · University College London · Imperial College London +1 more

Paired consistency training that enforces robust AI-image detection under JPEG compression and degradations via explicit feature alignment

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 25, 2026 · 8w ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal

PDF Code

attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp

PDF Code

benchmark arXiv Mar 14, 2026 · 9w ago

Benchmarking the Energy Cost of Assurance in Neuromorphic Edge Robotics

Sylvester Kaczmarek · Imperial College London

Benchmarks neuromorphic defenses against gradient and temporal attacks, achieving 82% to 19% attack reduction while maintaining ultralow energy

Input Manipulation Attack vision

PDF

defense arXiv Mar 3, 2026 · 11w ago

IoUCert: Robustness Verification for Anchor-based Object Detectors

Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang et al. · Safe Intelligence · Imperial College London

Formal certified robustness verification framework for anchor-based object detectors using novel IBP over IoU metrics

Input Manipulation Attack vision

PDF

defense arXiv Feb 18, 2026 · Feb 2026

Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

Philip Sosnin, Jodie Knapp, Fraser Kennedy et al. · Imperial College London · The Alan Turing Institute

First sound-and-complete certification of data poisoning robustness via a single mixed-integer quadratic program encoding training dynamics

Data Poisoning Attack

PDF

defense arXiv Feb 10, 2026 · Feb 2026

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker · Imperial College London

Proposes TPA, the first certified defense against targeted data poisoning attacks for autoregressive LLMs using MILP-backed guarantees

Data Poisoning Attack Training Data Poisoning nlp

PDF

defense arXiv Feb 8, 2026 · Feb 2026

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi, Avinash Kori, Francesca Toni et al. · Imperial College London

Defends text-to-image diffusion models via dynamic selective neuron unlearning robust against adversarial prompt bypasses of safety measures

Prompt Injection generative

PDF

attack arXiv Jan 30, 2026 · Jan 2026

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas · University College London · ML Alignment Theory Scholars +4 more

Maliciously LoRA-fine-tuned LLMs covertly exfiltrate prompt secrets via geometry-based steganography, detected via linear probes on internal activations

Model Poisoning Sensitive Information Disclosure nlp

PDF

attack arXiv Jan 29, 2026 · Jan 2026

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Javier Carnerero-Cano, Luis Muñoz-González, Phillippa Spencer et al. · IBM Research Europe · Imperial College London +3 more

Stealthy bilevel-optimization poisoning attacks bypass regression defenses; BayesClean uses Bayesian uncertainty to detect them

Data Poisoning Attack tabular

PDF

defense arXiv Jan 28, 2026 · Jan 2026

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp

PDF

benchmark arXiv Jan 24, 2026 · Jan 2026

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz, Georgios Kaissis et al. · Technical University of Munich · TUM University Hospital +1 more

Benchmarks PII extraction attacks and four defenses against unintended memorization in fine-tuned LLMs using black-box probes

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense IEEE Transactions on Image Pro... Jan 23, 2026 · Jan 2026

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision

PDF Code

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

transformer cnn University of Exeter · King Abdullah University of Science and Technology · Xi’an Jiaotong-Liverpool University +5 more

PDF arXiv DOI Code

attack arXiv Jan 19, 2026 · Jan 2026

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Johannes Kaiser, Alexander Ziller, Eleni Triantafillou et al. · Technical University of Munich · University of Potsdam +2 more

Exposes collusion vulnerability in iDP where adversaries manipulate others' privacy budgets to amplify membership inference attacks on targeted individuals

Membership Inference Attack

PDF

benchmark arXiv Jan 14, 2026 · Jan 2026

Blue Teaming Function-Calling Agents

Greta Dolcetti, Giulio Zizzo, Sergio Maffeis · Ca’ Foscari University of Venice · IBM Research +1 more

Benchmarks prompt injection and tool poisoning attacks against four open-source function-calling LLMs alongside eight defenses, finding none production-ready

Prompt Injection Insecure Plugin Design nlp

PDF

defense IEEE IoT-J Jan 12, 2026 · Jan 2026

Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge

James Calo, Benny Lo · Imperial College London

Blockchain consensus mechanism for federated learning defends against model inversion and Byzantine attacks via masked autoencoder data obfuscation

Model Inversion Attack Data Poisoning Attack federated-learning

PDF

benchmark arXiv Jan 6, 2026 · Jan 2026

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Jie Peng, Weiyu Li, Stefan Vlaski et al. · Sun Yat-Sen University · Harvard University +1 more

Theoretically proves weighted mean aggregator can outperform robust aggregators under label poisoning in decentralized learning, exposing topology-dependent weaknesses of robust aggregators

Data Poisoning Attack federated-learning

PDF

Loading more papers…

Latest papers

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

Unleashing Vision-Language Semantics for Deepfake Video Detection

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Benchmarking the Energy Cost of Assurance in Neuromorphic Edge Robotics

IoUCert: Robustness Verification for Anchor-based Object Detectors

Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

Towards Poisoning Robustness Certification for Natural Language Generation

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

How does information access affect LLM monitors' ability to detect sabotage?

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Blue Teaming Function-Calling Agents

Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue