Jorio Cocola

h-index: 5 44 citations 11 papers (total)

Papers in Database (1)

attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
10 citations PDF