Kartik Pandit

h-index: 1 5 citations 3 papers (total)

Papers in Database (1)

defense arXiv Oct 3, 2025 · Oct 2025

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee et al. · New Jersey Institute of Technology · Heritage Institute of Technology

Proposes CS-RLHF, a penalty-based constrained RLHF framework offering certifiable safety and 5x jailbreak resistance over Lagrangian baselines

Prompt Injection nlpreinforcement-learning
PDF Code