Ming Jin

h-index: 0 0 citations 1 papers (total)

Papers in Database (1)

defense arXiv Feb 9, 2026 · 8w ago

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp
PDF