ML Security Papers

Latest papers

7 papers

attack arXiv Feb 16, 2026 · 7w ago

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies, Giorgi Giglemiani, Edmund Lau et al. · UK AI Security Institute · University of Oxford

Fully black-box automated jailbreak using binary classifier feedback and curriculum learning defeats Anthropic and GPT-5 safety classifiers

Prompt Injection nlp

PDF

defense arXiv Dec 15, 2025 · Dec 2025

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Asa Cooper Stickland, Jan Michelfeit, Arathi Mani et al. · UK AI Security Institute

Stress-tests asynchronous monitors for misaligned LLM coding agents via iterative red-blue team games in realistic SWE environments

Excessive Agency Prompt Injection nlp

1 citations PDF Code

benchmark arXiv Oct 26, 2025 · Oct 2025

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Julia Bazinska, Max Mathys, Francesco Casucci et al. · Lakera AI · ETH Zürich +2 more

Benchmarks 34 backbone LLMs against 194K crowdsourced adversarial attacks using a threat-snapshot framework for AI agent security

Prompt Injection Excessive Agency nlp

1 citations PDF

attack arXiv Oct 8, 2025 · Oct 2025

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly, Javier Rando, Ed Chapman et al. · UK AI Security Institute · Anthropic +3 more

Shows LLM backdoor poisoning needs only ~250 documents regardless of model size, making attacks more practical at scale

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp

32 citations 2 influentialPDF

defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp

8 citations PDF

tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp

4 citations PDF

defense arXiv Aug 8, 2025 · Aug 2025

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony et al. · EleutherAI · UK AI Security Institute +1 more

Defends open-weight LLMs against adversarial fine-tuning by filtering biothreat data from pretraining, resisting 10K fine-tuning steps

Transfer Learning Attack nlp

PDF

Latest papers

Boundary Point Jailbreaking of Black-Box LLMs

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue