Latest papers

46 papers
defense arXiv Feb 26, 2026 · 5w ago

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp
PDF
attack arXiv Feb 16, 2026 · 7w ago

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies, Giorgi Giglemiani, Edmund Lau et al. · UK AI Security Institute · University of Oxford

Fully black-box automated jailbreak using binary classifier feedback and curriculum learning defeats Anthropic and GPT-5 safety classifiers

Prompt Injection nlp
PDF
attack arXiv Feb 15, 2026 · 7w ago

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Xiaojun Jia, Jie Liao, Simeng Qin et al. · Nanyang Technological University · Chongqing University +4 more

Automated framework crafts stealthy skill-based prompt injections against LLM coding agents using closed-loop refinement agents

Prompt Injection Insecure Plugin Design nlp
PDF
attack arXiv Feb 13, 2026 · 7w ago

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Akshat Naik, Jay J Culligan, Yarin Gal et al. · University of Oxford · Toyota Motor Europe

Indirect prompt injection attack exfiltrates sensitive data across multi-agent LLM orchestrators, bypassing data access controls with a single injected payload

Prompt Injection Sensitive Information Disclosure nlp
PDF
attack arXiv Feb 10, 2026 · 7w ago

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette et al. · University of Oxford · Independent +2 more

Poisons ML models by perturbing existing training data via influence functions, inducing targeted behavior without injecting explicit attack examples

Data Poisoning Attack Training Data Poisoning visionnlp
PDF Code
attack arXiv Feb 4, 2026 · 8w ago

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Joachim Schaeffer, Arjun Khandelwal, Tyler Tracy · Pivotal Research · University of Oxford +1 more

LLMs reasoning about monitors while selecting attacks reduce AI control safety from 99% to 59%, exposing optimistic safety evaluation blind spots

Excessive Agency nlp
PDF Code
defense arXiv Feb 2, 2026 · 9w ago

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Zehua Cheng, Jianwei Yang, Wei Dai et al. · University of Oxford · FLock.io +1 more

Proposes certifiably robust LLM jailbreak defense via randomized ablation smoothing, cutting GCG attack success from 84% to 1%

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Jan 31, 2026 · 9w ago

Safety-Efficacy Trade Off: Robustness against Data-Poisoning

Diego Granziol · University of Oxford

Proves dirty-label backdoor attacks can be provably spectrally invisible; proposes input-gradient regularization defense with unavoidable safety-efficacy trade-off

Model Poisoning Data Poisoning Attack vision
PDF
attack arXiv Jan 30, 2026 · 9w ago

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Yupeng Chen, Junchi Yu, Aoxi Liu et al. · University of Oxford · The Chinese University of Hong Kong

Transfers text jailbreaks to audio via modality alignment in omni-models, outperforming native audio jailbreaks as a new red-teaming baseline

Prompt Injection audionlpmultimodal
PDF
attack arXiv Jan 30, 2026 · 9w ago

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas · University College London · ML Alignment Theory Scholars +4 more

Maliciously LoRA-fine-tuned LLMs covertly exfiltrate prompt secrets via geometry-based steganography, detected via linear probes on internal activations

Model Poisoning Sensitive Information Disclosure nlp
PDF
attack arXiv Jan 30, 2026 · 9w ago

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin et al. · University of Oxford · The Chinese University of Hong Kong +2 more

Discovers D-LLMs' intrinsic jailbreak resistance, then breaks it with context nesting prompts achieving SOTA attack rates

Prompt Injection nlp
PDF
attack arXiv Jan 23, 2026 · 10w ago

Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

Bethan Evans, Jared Tanner · University of Oxford

Derives minimal weight perturbation bounds for DNNs and shows low-rank compression reliably activates latent hidden backdoors

Model Poisoning vision
PDF
benchmark arXiv Jan 21, 2026 · 10w ago

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun et al. · Parameter Lab · TU Darmstadt +3 more

Benign fine-tuning silently breaks contextual privacy in LLMs, causing inappropriate data disclosure undetected by standard safety benchmarks

Transfer Learning Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Jan 19, 2026 · 11w ago

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong et al. · University of Cincinnati · University of Toronto +1 more

Compares six LLM fine-tuning objectives and finds ORPO and KL-regularization best preserve jailbreak resistance and alignment at scale

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Jan 15, 2026 · 11w ago

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala et al. · MATS · Anthropic Fellows Program +2 more

Discovers an 'Assistant Axis' in LLM activations and uses activation capping to block persona-based jailbreaks and harmful drift

Prompt Injection nlp
10 citations 1 influentialPDF
benchmark arXiv Dec 29, 2025 · Dec 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki et al. · University of Oxford · SoftServe +2 more

Benchmarks indirect prompt injection susceptibility of six frontier LLM agents on realistic web tasks using persuasion techniques

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Dec 15, 2025 · Dec 2025

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang, Weijia Wu, Yi Zhan et al. · CUHK · NUS +2 more

Benchmark revealing VLMs barely exceed chance at detecting AI-generated ASMR videos, far below human expert accuracy

Output Integrity Attack visionaudiomultimodal
1 citations PDF Code
defense arXiv Dec 10, 2025 · Dec 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud et al. · King Abdullah University of Science and Technology · University of Oxford

Continual learning methods preserve LLM safety alignment during fine-tuning, outperforming existing defenses on both benign and poisoned data

Transfer Learning Attack Prompt Injection nlp
2 citations PDF
defense arXiv Dec 8, 2025 · Dec 2025

Towards Robust Protective Perturbation against DeepFake Face Swapping

Hengyang Yao, Lin Li, Ke Sun et al. · University of Birmingham · University of Oxford +2 more

Defends faces against deepfake swapping using RL-learned robust adversarial perturbations, outperforming EOT baselines by 26%

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

James R. M. Black, Moritz S. Hanke, Aaron Maiwald et al. · Johns Hopkins Bloomberg School of Public Health · University of Oxford +1 more

Adversarial fine-tuning on viral sequences bypasses data-exclusion safety filtering in an open-weight genomic language model, restoring restricted capabilities

Transfer Learning Attack nlpgenerative
3 citations PDF
Loading more papers…