Stella Biderman

Papers in Database (1)

defense arXiv Aug 8, 2025 · Aug 2025

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony et al. · EleutherAI · UK AI Security Institute +1 more

Defends open-weight LLMs against adversarial fine-tuning by filtering biothreat data from pretraining, resisting 10K fine-tuning steps

Transfer Learning Attack nlp
PDF