Latest papers

11 papers
defense arXiv Mar 11, 2026 · 26d ago

Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Eirik Høyheim, Magnus Wiik Eckhoff, Gudmund Grov et al. · Norwegian Defence Research Establishment (FFI) · University of Oslo +1 more

Detects and eliminates neural network backdoors via active path analysis, demonstrated on an IDS model

Model Poisoning tabular
PDF Code
defense arXiv Feb 23, 2026 · 6w ago

The LLMbda Calculus: AI Agents, Conversations, and Information Flow

Zac Garby, Andrew D. Gordon, David Sands · University of Nottingham · University of Edinburgh +2 more

Formal lambda calculus with dynamic information-flow control proves noninterference guarantees for LLM agents against prompt injection

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Feb 11, 2026 · 7w ago

Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamganaï, Kartic Subr · University of Edinburgh

Gradient-based LM inversion finds adversarial input prompts that reliably produce target output sequences via end-to-end differentiable token distributions

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Jan 8, 2026 · 12w ago

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Han Zhu, Jiale Chen, Chengkun Cai et al. · Hong Kong University of Science and Technology · Sun Yat-Sen University +3 more

GRPO-based safety alignment framework defending MLLMs against multi-turn jailbreaks via dataset and turn-aware dual-objective rewards

Prompt Injection multimodalnlp
PDF
benchmark arXiv Dec 12, 2025 · Dec 2025

Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Kai Yao, Marc Juarez · University of Edinburgh

Benchmarks robustness of 14 AI image fingerprinting methods against removal and forgery attacks across white- and black-box threat models

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Dec 5, 2025 · Dec 2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema et al. · Anthropic Fellows Program · Imperial College London +3 more

Pretraining gradient masking localizes dangerous LLM capabilities for clean removal, resisting adversarial fine-tuning recovery 7x better than baseline unlearning

Prompt Injection nlp
3 citations 1 influentialPDF Code
defense IJCNLP-AACL Oct 19, 2025 · Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko, Zeerak Talat, Timothy Baldwin · MBZUAI · University of Edinburgh

Online learning defense dynamically counters iterative LLM jailbreaks via RL prompt optimization and gradient damping

Prompt Injection nlp
3 citations PDF
attack arXiv Oct 15, 2025 · Oct 2025

Personal Attribute Leakage in Federated Speech Models

Hamdan Al-Ali, Ali Reza Ghavamipour, Tommaso Caselli et al. · Mohamed bin Zayed University of Artificial Intelligence · Maastricht University +2 more

Infers private personal attributes from federated ASR model weight differentials using shadow models and centroid classification

Model Inversion Attack audiofederated-learning
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu, Juntao Dai, Jiaming Ji et al. · Hong Kong University of Science and Technology · Peking University +1 more

Benchmarks multi-turn jailbreak safety of 17 multimodal LLMs and proposes a dialogue safety moderator to reduce attack success rates

Prompt Injection multimodalnlp
3 citations PDF
attack arXiv Aug 24, 2025 · Aug 2025

How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System

Kaiwen Zuo, Zelin Liu, Raman Dutt et al. · University of Warwick · Shanghai Jiao Tong University +5 more

Poisons medical RAG knowledge bases with adversarial image-text pairs to degrade LLaVA-Med-1.5 diagnostic outputs by up to 27.66% F1

Data Poisoning Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Aug 6, 2025 · Aug 2025

AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers

Kai Yao, Marc Juarez · University of Edinburgh

Fingerprints generative model output distributions to detect when a certified model is secretly replaced by a malicious provider

Output Integrity Attack visiongenerative
PDF