Masahiro Kaneko

defense IJCNLP-AACL Oct 19, 2025 · Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko, Zeerak Talat, Timothy Baldwin · MBZUAI · University of Edinburgh

Online learning defense dynamically counters iterative LLM jailbreaks via RL prompt optimization and gradient damping

Prompt Injection nlp

3 citations PDF

benchmark arXiv Oct 19, 2025 · Oct 2025

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Masahiro Kaneko, Timothy Baldwin · MBZUAI

Information-theoretic framework bounds LLM adversarial query complexity as log(1/ε)/I(Z;T), quantifying exact security cost of exposing logits or chain-of-thought

Prompt Injection Sensitive Information Disclosure nlp

PDF

Papers in Database (2)

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs