benchmark 2026

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad 1, Boyi Wei 1,2, Kaden Zheng 1,2, Martin Wattenberg 1, Peter Henderson 3, Seraphina Goldfarb-Tarrant 1, Yonatan Belinkov 4,1

0 citations

α

Published on arXiv

2604.09544

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Harmful content generation compresses into approximately 0.0005% of model parameters; pruning these weights substantially reduces emergent misalignment while preserving benign capabilities and harm detection ability

Targeted weight pruning for harm mechanism isolation

Novel technique introduced


Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.


Key Contributions

  • Identifies that harmful content generation depends on ~0.0005% of model parameters that are general across harm types and distinct from benign capabilities
  • Shows aligned models exhibit greater compression of harm generation weights than unaligned models, explaining emergent misalignment
  • Demonstrates that pruning harm-generation weights reduces emergent misalignment even when pruning data comes from different harm domains than fine-tuning data

🛡️ Threat Analysis

Transfer Learning Attack

Paper directly addresses emergent misalignment where fine-tuning on narrow domains causes harmful behaviors to generalize broadly—a core transfer learning attack scenario where alignment safety degrades through the fine-tuning process.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_time
Applications
llm safetyalignment trainingjailbreak defense