Isotta Magistrali

Papers in Database (1)

attack arXiv Mar 1, 2026 · 5w ago

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey et al. · ETH Zürich

Biased LLM judge covertly encodes behavioral traits into student models via binary RLHF preference labels, bypassing semantic oversight

Transfer Learning Attack Data Poisoning Attack Training Data Poisoning nlp
PDF Code