Roman Belaire

Papers in Database (1)

attack arXiv Aug 6, 2025 · Aug 2025

Automatic LLM Red Teaming

Roman Belaire, Arunesh Sinha, Pradeep Varakantham · Singapore Management University · Rutgers University

Trains an RL agent to conduct multi-turn jailbreak attacks on LLMs by formalizing red teaming as a hierarchical MDP

Prompt Injection nlpreinforcement-learning
PDF