Sai Praneeth Karimireddy

attack arXiv Jan 29, 2026 · 9w ago

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr et al. · University of Southern California

Proves static black-box alignment guarantees nothing post-update; constructs LLMs hiding latent jailbreak misalignment triggered by one benign gradient step

Model Poisoning Prompt Injection nlp

1 citations PDF

benchmark arXiv Sep 22, 2025 · Sep 2025

VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks

Efthymios Tsaprazlis, Thanathai Lertpetchpun, Tiantian Feng et al. · University of Southern California

VoxGuard benchmarks voice anonymization privacy via low-FPR membership inference, showing EER massively underestimates adversarial leakage

Membership Inference Attack audio

PDF

Papers in Database (2)

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks