Philip Torr

attack arXiv Oct 2, 2025 · Oct 2025

ToolTweak: An Attack on Tool Selection in LLM-based Agents

Jonathan Sneh, Ruomei Yan, Jialin Yu et al. · University of Oxford · Microsoft

Adversarially crafts tool names and descriptions to bias LLM agents into selecting attacker-controlled tools over fair alternatives

Insecure Plugin Design Prompt Injection nlp

6 citations 1 influentialPDF

attack arXiv Sep 25, 2025 · Sep 2025

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan et al. · The University of Sydney · University of Oxford

Improves transferability of adversarial visual jailbreaks against closed-source MLLMs via loss landscape flattening and feature over-reliance correction

Input Manipulation Attack Prompt Injection visionmultimodalnlp

6 citations PDF

defense arXiv Oct 16, 2025 · Oct 2025

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang, Yibo Yang, Zhe Ren et al. · Jilin University · King Abdullah University of Science and Technology +1 more

Defends LLM safety alignment during fine-tuning by freezing safety-relevant weight subspaces and projecting adapter updates into a harmful-resistant null space

Transfer Learning Attack Prompt Injection nlp

3 citations PDF

defense arXiv Dec 10, 2025 · Dec 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud et al. · King Abdullah University of Science and Technology · University of Oxford

Continual learning methods preserve LLM safety alignment during fine-tuning, outperforming existing defenses on both benign and poisoned data

Transfer Learning Attack Prompt Injection nlp

2 citations PDF

attack arXiv Feb 13, 2026 · 7w ago

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Akshat Naik, Jay J Culligan, Yarin Gal et al. · University of Oxford · Toyota Motor Europe

Indirect prompt injection attack exfiltrates sensitive data across multi-agent LLM orchestrators, bypassing data access controls with a single injected payload

Prompt Injection Sensitive Information Disclosure nlp

PDF

attack arXiv Jan 30, 2026 · 9w ago

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Yupeng Chen, Junchi Yu, Aoxi Liu et al. · University of Oxford · The Chinese University of Hong Kong

Transfers text jailbreaks to audio via modality alignment in omni-models, outperforming native audio jailbreaks as a new red-teaming baseline

Prompt Injection audionlpmultimodal

PDF

attack arXiv Jan 30, 2026 · 9w ago

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin et al. · University of Oxford · The Chinese University of Hong Kong +2 more

Discovers D-LLMs' intrinsic jailbreak resistance, then breaks it with context nesting prompts achieving SOTA attack rates

Prompt Injection nlp

PDF

benchmark arXiv Dec 29, 2025 · Dec 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki et al. · University of Oxford · SoftServe +2 more

Benchmarks indirect prompt injection susceptibility of six frontier LLM agents on realistic web tasks using persuasion techniques

Prompt Injection Excessive Agency nlp

PDF

Papers in Database (8)