Latest papers

2 papers
defense arXiv Feb 24, 2026 · 5w ago

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar et al. · University of Virginia · Capital One

Defends LLMs against jailbreaks by training reasoning-aware refusals via CoT datasets and segment-weighted DPO

Prompt Injection nlp
PDF
defense arXiv Nov 2, 2025 · Nov 2025

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min et al. · Capital One · Arizona State University +2 more

GFlowNet-based concept erasure for diffusion models, robust to adversarial bypass attacks, without requiring crafted reward models

Output Integrity Attack Input Manipulation Attack visiongenerative
1 citations 1 influentialPDF