Latest papers

12 papers
defense arXiv Feb 26, 2026 · 5w ago

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp
PDF
defense arXiv Feb 12, 2026 · 7w ago

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Oliver Zahn, Matt Beton, Simran Chana · arXiv · Independent Researcher +1 more

Defends against data poisoning via contributor-reputation-weighted training, outperforming Byzantine-robust baselines under joint credential-faking and gradient-alignment attacks

Data Poisoning Attack tabularfederated-learning
PDF
defense arXiv Jan 30, 2026 · 9w ago

No More, No Less: Least-Privilege Language Models

Paulius Rauba, Dominykas Seputis, Patrikas Vanagas et al. · University of Cambridge · Vinted +2 more

Proposes inference-time capability restriction for LLMs by controlling reachable internal computation via rank-indexed weight interventions

Prompt Injection nlp
PDF
attack arXiv Jan 27, 2026 · 9w ago

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al. · Northeastern University · University of Cambridge +4 more

Poisons LLM CoT training data by corrupting reasoning traces to inject targeted behaviors into unseen domains without altering queries or answers

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Jan 14, 2026 · 11w ago

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić et al. · University of Cambridge · University of Toronto +3 more

Defends computer-use AI agents against prompt injection via pre-computed execution graphs, revealing Branch Steering as a residual threat

Prompt Injection Excessive Agency nlpmultimodal
1 citations PDF
defense arXiv Dec 17, 2025 · Dec 2025

Remotely Detectable Robot Policy Watermarking

Michael Amir, Manon Flageat, Amanda Prorok · University of Cambridge

Watermarks robot RL policies with spectral motion signals detectable remotely via video for IP ownership verification

Model Theft reinforcement-learningvision
PDF
benchmark arXiv Nov 13, 2025 · Nov 2025

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory +1 more

Black-box audio jailbreaks via speech composition bypass multimodal LLM guardrails; SALMONN-Guard cuts attack success from 66% to 20%

Prompt Injection audiomultimodalnlp
3 citations PDF Code
attack arXiv Nov 10, 2025 · Nov 2025

Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents

Hanlin Cai, Houtianfu Wang, Haofan Dong et al. · University of Cambridge · CISTER Research Centre +2 more

Graph autoencoder-based Byzantine attack on federated LLM fine-tuning that evades cosine/distance-based defenses by mimicking benign update statistics

Data Poisoning Attack federated-learningnlp
1 citations PDF
attack arXiv Sep 6, 2025 · Sep 2025

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Hanna Foerster, Ilia Shumailov, Yiren Zhao et al. · University of Cambridge · Google DeepMind +3 more

Proposes split-trigger backdoors that corrupt LLM reasoning paths, but finds reasoning models exhibit emergent robustness against final-answer manipulation

Model Poisoning Training Data Poisoning nlp
PDF
attack arXiv Aug 24, 2025 · Aug 2025

How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System

Kaiwen Zuo, Zelin Liu, Raman Dutt et al. · University of Warwick · Shanghai Jiao Tong University +5 more

Poisons medical RAG knowledge bases with adversarial image-text pairs to degrade LLaVA-Med-1.5 diagnostic outputs by up to 27.66% F1

Data Poisoning Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Aug 22, 2025 · Aug 2025

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei et al. · University of Cambridge

RAG-based jailbreak defense for LLMs that retrieves known attack examples to detect and block prompt injection attempts without retraining

Prompt Injection nlp
PDF Code
benchmark arXiv Aug 1, 2025 · Aug 2025

DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Shantanu Thorat, Andrew Caines · University of Cambridge

Introduces adversarial AI-text detection benchmark where existing detectors fail on one-shot and CPT-generated LLM outputs

Output Integrity Attack nlp
PDF