Geoffrey Irving

attack arXiv Feb 16, 2026 · 7w ago

Xander Davies, Giorgi Giglemiani, Edmund Lau et al. · UK AI Security Institute · University of Oxford

Fully black-box automated jailbreak using binary classifier feedback and curriculum learning defeats Anthropic and GPT-5 safety classifiers

Prompt Injection nlp

Papers in Database (1)