Latest papers

35 papers
defense arXiv Mar 25, 2026 · 12d ago

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang et al. · The University of Warwick · Nanyang Technological University +2 more

Deepfake detector leveraging CLIP's vision-language semantics with identity-aware prompting to achieve fine-grained forgery localization

Output Integrity Attack visionmultimodal
PDF Code
attack arXiv Mar 25, 2026 · 12d ago

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov et al. · MATS · ELLIS Institute Tübingen +3 more

AI agent autonomously discovers novel white-box jailbreak attacks outperforming 30+ existing methods with 100% ASR on target models

Input Manipulation Attack Prompt Injection nlp
PDF Code
benchmark arXiv Mar 14, 2026 · 23d ago

Benchmarking the Energy Cost of Assurance in Neuromorphic Edge Robotics

Sylvester Kaczmarek · Imperial College London

Benchmarks neuromorphic defenses against gradient and temporal attacks, achieving 82% to 19% attack reduction while maintaining ultralow energy

Input Manipulation Attack vision
PDF
defense arXiv Mar 3, 2026 · 4w ago

IoUCert: Robustness Verification for Anchor-based Object Detectors

Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang et al. · Safe Intelligence · Imperial College London

Formal certified robustness verification framework for anchor-based object detectors using novel IBP over IoU metrics

Input Manipulation Attack vision
PDF
defense arXiv Feb 18, 2026 · 6w ago

Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

Philip Sosnin, Jodie Knapp, Fraser Kennedy et al. · Imperial College London · The Alan Turing Institute

First sound-and-complete certification of data poisoning robustness via a single mixed-integer quadratic program encoding training dynamics

Data Poisoning Attack
PDF
defense arXiv Feb 10, 2026 · 7w ago

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker · Imperial College London

Proposes TPA, the first certified defense against targeted data poisoning attacks for autoregressive LLMs using MILP-backed guarantees

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Feb 8, 2026 · 8w ago

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi, Avinash Kori, Francesca Toni et al. · Imperial College London

Defends text-to-image diffusion models via dynamic selective neuron unlearning robust against adversarial prompt bypasses of safety measures

Prompt Injection generative
PDF
attack arXiv Jan 30, 2026 · 9w ago

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas · University College London · ML Alignment Theory Scholars +4 more

Maliciously LoRA-fine-tuned LLMs covertly exfiltrate prompt secrets via geometry-based steganography, detected via linear probes on internal activations

Model Poisoning Sensitive Information Disclosure nlp
PDF
attack arXiv Jan 29, 2026 · 9w ago

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Javier Carnerero-Cano, Luis Muñoz-González, Phillippa Spencer et al. · IBM Research Europe · Imperial College London +3 more

Stealthy bilevel-optimization poisoning attacks bypass regression defenses; BayesClean uses Bayesian uncertainty to detect them

Data Poisoning Attack tabular
PDF
defense arXiv Jan 28, 2026 · 9w ago

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani et al. · Aether Research · Vector Institute +4 more

Proposes extract-and-evaluate monitoring to catch sabotaging LLM agents, finding less monitor information often yields better detection.

Excessive Agency nlp
PDF
benchmark arXiv Jan 24, 2026 · 10w ago

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz, Georgios Kaissis et al. · Technical University of Munich · TUM University Hospital +1 more

Benchmarks PII extraction attacks and four defenses against unintended memorization in fine-tuned LLMs using black-box probes

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision
PDF Code
attack arXiv Jan 19, 2026 · 11w ago

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Johannes Kaiser, Alexander Ziller, Eleni Triantafillou et al. · Technical University of Munich · University of Potsdam +2 more

Exposes collusion vulnerability in iDP where adversaries manipulate others' privacy budgets to amplify membership inference attacks on targeted individuals

Membership Inference Attack
PDF
benchmark arXiv Jan 14, 2026 · 11w ago

Blue Teaming Function-Calling Agents

Greta Dolcetti, Giulio Zizzo, Sergio Maffeis · Ca’ Foscari University of Venice · IBM Research +1 more

Benchmarks prompt injection and tool poisoning attacks against four open-source function-calling LLMs alongside eight defenses, finding none production-ready

Prompt Injection Insecure Plugin Design nlp
PDF
defense IEEE IoT-J Jan 12, 2026 · 12w ago

Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge

James Calo, Benny Lo · Imperial College London

Blockchain consensus mechanism for federated learning defends against model inversion and Byzantine attacks via masked autoencoder data obfuscation

Model Inversion Attack Data Poisoning Attack federated-learning
PDF
benchmark arXiv Jan 6, 2026 · Jan 2026

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Jie Peng, Weiyu Li, Stefan Vlaski et al. · Sun Yat-Sen University · Harvard University +1 more

Theoretically proves weighted mean aggregator can outperform robust aggregators under label poisoning in decentralized learning, exposing topology-dependent weaknesses of robust aggregators

Data Poisoning Attack federated-learning
PDF
defense arXiv Dec 5, 2025 · Dec 2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema et al. · Anthropic Fellows Program · Imperial College London +3 more

Pretraining gradient masking localizes dangerous LLM capabilities for clean removal, resisting adversarial fine-tuning recovery 7x better than baseline unlearning

Prompt Injection nlp
3 citations 1 influentialPDF Code
defense arXiv Nov 29, 2025 · Nov 2025

Teleportation-Based Defenses for Privacy in Approximate Machine Unlearning

Mohammad M Maheri, Xavier Cadet, Peter Chin et al. · Imperial College London · Dartmouth College

Proposes WARP teleportation defense that obfuscates unlearning signals, resisting membership inference and data reconstruction attacks

Membership Inference Attack Model Inversion Attack vision
PDF
benchmark arXiv Nov 13, 2025 · Nov 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward, Teun van der Weij, Hanna Gábor et al. · Apollo Research · Independent +2 more

Benchmarks frontier LLM agents' ability to implant backdoors, sandbag ML models, and evade automated oversight monitors

Model Poisoning Excessive Agency nlp
2 citations 1 influentialPDF Code
attack arXiv Nov 13, 2025 · Nov 2025

eXIAA: eXplainable Injections for Adversarial Attack

Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo · National University of Singapore · Imperial College London

Black-box single-step adversarial attack corrupts XAI saliency explanations on images while preserving predictions and imperceptibility

Input Manipulation Attack vision
PDF
Loading more papers…