Latest papers

7 papers
attack arXiv Feb 16, 2026 · 7w ago

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies, Giorgi Giglemiani, Edmund Lau et al. · UK AI Security Institute · University of Oxford

Fully black-box automated jailbreak using binary classifier feedback and curriculum learning defeats Anthropic and GPT-5 safety classifiers

Prompt Injection nlp
PDF
defense arXiv Dec 15, 2025 · Dec 2025

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Asa Cooper Stickland, Jan Michelfeit, Arathi Mani et al. · UK AI Security Institute

Stress-tests asynchronous monitors for misaligned LLM coding agents via iterative red-blue team games in realistic SWE environments

Excessive Agency Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Oct 26, 2025 · Oct 2025

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Julia Bazinska, Max Mathys, Francesco Casucci et al. · Lakera AI · ETH Zürich +2 more

Benchmarks 34 backbone LLMs against 194K crowdsourced adversarial attacks using a threat-snapshot framework for AI agent security

Prompt Injection Excessive Agency nlp
1 citations PDF
attack arXiv Oct 8, 2025 · Oct 2025

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly, Javier Rando, Ed Chapman et al. · UK AI Security Institute · Anthropic +3 more

Shows LLM backdoor poisoning needs only ~250 documents regardless of model size, making attacks more practical at scale

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
32 citations 2 influentialPDF
defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp
8 citations PDF
tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp
4 citations PDF
defense arXiv Aug 8, 2025 · Aug 2025

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony et al. · EleutherAI · UK AI Security Institute +1 more

Defends open-weight LLMs against adversarial fine-tuning by filtering biothreat data from pretraining, resisting 10K fine-tuning steps

Transfer Learning Attack nlp
PDF