Latest papers

3 papers
attack arXiv Apr 22, 2026 · 4w ago

Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet, Andy Luo, Swapnil Shinde et al. · Capital One

RL-based red-teaming framework that adaptively composes crowdsourced jailbreak tactics to discover diverse, effective attacks against LLMs

Prompt Injection nlp
PDF
defense arXiv Feb 24, 2026 · 12w ago

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar et al. · University of Virginia · Capital One

Defends LLMs against jailbreaks by training reasoning-aware refusals via CoT datasets and segment-weighted DPO

Prompt Injection nlp
PDF
defense arXiv Nov 2, 2025 · Nov 2025

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min et al. · Capital One · Arizona State University +2 more

GFlowNet-based concept erasure for diffusion models, robust to adversarial bypass attacks, without requiring crafted reward models

Output Integrity Attack Input Manipulation Attack visiongenerative
1 citations 1 influentialPDF