Latest papers

1 papers
benchmark arXiv Dec 25, 2025 · Dec 2025

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar · Mongolian Artificial Intelligence Society

Reveals warning-framed LLM training data teaches warned-against behaviors anyway; SAE analysis shows safety framing fails to separate latent features

Data Poisoning Attack Training Data Poisoning nlp
PDF