Latest papers

39 papers
benchmark arXiv Apr 29, 2026 · 22d ago

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini et al. · University of Camerino · Imperial College London

Detects LLM alignment faking via tool selection mismatches between monitored and unmonitored contexts in enterprise IT scenarios

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Apr 22, 2026 · 29d ago

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis et al. · IBM Research Europe · Trinity College Dublin +1 more

Gradient-based adversarial attack that hijacks LLM function calling by inserting optimized tokens into function descriptions to force invocation of attacker-chosen tools

Input Manipulation Attack Insecure Plugin Design Excessive Agency nlp
PDF
defense arXiv Apr 20, 2026 · 4w ago

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Sina Abdollahi, Mohammad M Maheri, Javad Forough et al. · Imperial College London · Dartmouth College

Secure LLM agent deployment system using Arm confidential VMs to isolate runtime, inference, and plugins on edge devices

AI Supply Chain Attacks Insecure Plugin Design Excessive Agency nlp
PDF
defense arXiv Apr 11, 2026 · 5w ago

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

Zongyou Yang, Yinghan Hou, Xiaokun Yang · University College London · Imperial College London +1 more

Paired consistency training that enforces robust AI-image detection under JPEG compression and degradations via explicit feature alignment

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 25, 2026 · 8w ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal
PDF Code
attack arXiv Mar 25, 2026 · 8w ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection Red-Team Agents Exploit Generation nlp
PDF Code
benchmark arXiv Mar 14, 2026 · 9w ago

Benchmarking the Energy Cost of Assurance in Neuromorphic Edge Robotics

Sylvester Kaczmarek · Imperial College London

Benchmarks neuromorphic defenses against gradient and temporal attacks, achieving 82% to 19% attack reduction while maintaining ultralow energy

Input Manipulation Attack vision
PDF
defense arXiv Mar 3, 2026 · 11w ago

IoUCert: Robustness Verification for Anchor-based Object Detectors

Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang et al. · Safe Intelligence · Imperial College London

Formal certified robustness verification framework for anchor-based object detectors using novel IBP over IoU metrics

Input Manipulation Attack vision
PDF
defense arXiv Feb 18, 2026 · Feb 2026

Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

Philip Sosnin, Jodie Knapp, Fraser Kennedy et al. · Imperial College London · The Alan Turing Institute

First sound-and-complete certification of data poisoning robustness via a single mixed-integer quadratic program encoding training dynamics

Data Poisoning Attack
PDF
defense arXiv Feb 10, 2026 · Feb 2026

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker · Imperial College London

Proposes TPA, the first certified defense against targeted data poisoning attacks for autoregressive LLMs using MILP-backed guarantees

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Feb 8, 2026 · Feb 2026

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi, Avinash Kori, Francesca Toni et al. · Imperial College London

Defends text-to-image diffusion models via dynamic selective neuron unlearning robust against adversarial prompt bypasses of safety measures

Prompt Injection generative
PDF
attack arXiv Jan 30, 2026 · Jan 2026

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas · University College London · ML Alignment Theory Scholars +4 more

Maliciously LoRA-fine-tuned LLMs covertly exfiltrate prompt secrets via geometry-based steganography, detected via linear probes on internal activations

Model Poisoning Sensitive Information Disclosure nlp
PDF
attack arXiv Jan 29, 2026 · Jan 2026

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Javier Carnerero-Cano, Luis Muñoz-González, Phillippa Spencer et al. · IBM Research Europe · Imperial College London +3 more

Stealthy bilevel-optimization poisoning attacks bypass regression defenses; BayesClean uses Bayesian uncertainty to detect them

Data Poisoning Attack tabular
PDF
defense arXiv Jan 28, 2026 · Jan 2026

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
benchmark arXiv Jan 24, 2026 · Jan 2026

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz, Georgios Kaissis et al. · Technical University of Munich · TUM University Hospital +1 more

Benchmarks PII extraction attacks and four defenses against unintended memorization in fine-tuned LLMs using black-box probes

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
defense IEEE Transactions on Image Pro... Jan 23, 2026 · Jan 2026

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision
PDF Code
attack arXiv Jan 19, 2026 · Jan 2026

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Johannes Kaiser, Alexander Ziller, Eleni Triantafillou et al. · Technical University of Munich · University of Potsdam +2 more

Exposes collusion vulnerability in iDP where adversaries manipulate others' privacy budgets to amplify membership inference attacks on targeted individuals

Membership Inference Attack
PDF
benchmark arXiv Jan 14, 2026 · Jan 2026

Blue Teaming Function-Calling Agents

Greta Dolcetti, Giulio Zizzo, Sergio Maffeis · Ca’ Foscari University of Venice · IBM Research +1 more

Benchmarks prompt injection and tool poisoning attacks against four open-source function-calling LLMs alongside eight defenses, finding none production-ready

Prompt Injection Insecure Plugin Design nlp
PDF
defense IEEE IoT-J Jan 12, 2026 · Jan 2026

Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge

James Calo, Benny Lo · Imperial College London

Blockchain consensus mechanism for federated learning defends against model inversion and Byzantine attacks via masked autoencoder data obfuscation

Model Inversion Attack Data Poisoning Attack federated-learning
PDF
benchmark arXiv Jan 6, 2026 · Jan 2026

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Jie Peng, Weiyu Li, Stefan Vlaski et al. · Sun Yat-Sen University · Harvard University +1 more

Theoretically proves weighted mean aggregator can outperform robust aggregators under label poisoning in decentralized learning, exposing topology-dependent weaknesses of robust aggregators

Data Poisoning Attack federated-learning
PDF
Loading more papers…