Latest papers

3 papers
defense arXiv Mar 10, 2026 · 27d ago

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen · Rice University · Stony Brook University

Defends LLM safety alignment against fine-tuning-induced degradation using generative replay of synthesized safety data

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Oct 13, 2025 · Oct 2025

Quantifying Information Disclosure During Gradient Descent Using Gradient Uniqueness

Sleem Abdelghafar, Maryam Aliakbarpour, Chris Jermaine · Rice University

Proposes Gradient Uniqueness metric to audit per-datapoint training data disclosure risk in LLMs, predicting sequence extractability with a scalable in-run algorithm

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
tool arXiv Aug 18, 2025 · Aug 2025

Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods

Jaeung Lee, Suhyeon Yu, Yurim Jang et al. · Sungkyunkwan University · Rice University

Visual analytics tool for comparing machine unlearning methods, with integrated membership inference attack simulation to assess privacy

Membership Inference Attack vision
PDF Code