Skip to content

How the build system works

Nathan Carter edited this page Sep 28, 2023 · 1 revision

If you have cloned the How to Data GitHub repository to your computer, you can build your own copy of the website. Doing so is useful if you have made changes to site content and want to test those changes before submitting them for inclusion in the site. The easiest way to do so is using the Control Panel, as documented in this wiki. You can also build the site by running the build.sh script from the command line, which is described below.

This page of the wiki documents how that build process works. It focuses on the build script rather than the Control Panel app, because the app is complex but the script is simpler.

Each file involved in the build process is described below, ordered in a way that makes it easiest to read from top to bottom.

This file is a simple shell script that does the following.

  • ensures that you are in the correct directory
  • runs build.py, which is the main script of the build process (described below)
  • copies the CNAME file into the output folder to ensure GitHub can use the how-to-data.org domain name

This script loads the how-to-data module (documented below) and uses it as follows.

This module does all the work of the build process, including the following steps.

  • loading all the other modules in the codebase, which includes modules for tasks, topics, solutions, configuration, and more
  • defining several global functions that combine the functionality of those modules (e.g., clearing all their caches, etc.)
  • defines the two main functions of the build process:
    • database_to_jekyll() converts the database/ folder to the jekyll-input/ folder, as mentioned above, and documented further below
    • jekyll_to_site() runs the Jekyll build script on the contents of the jekyll-input/ folder to build the site, as mentioned above

Converting the database to Jekyll input

The central function database-to-jekyll() (mentioned above as part of how-to-data.py) takes the following steps.

  • Delete any files generated in the past build that are no longer needed.
  • Copy any static files (e.g., images) from the database to the jekyll-input/ folder.
  • Copy any static pages as well, but replace placeholders in them with relevant content (e.g., replace the placeholder OVERALL_STATS with a table of the overall statistics of the website, such as number of contributors, etc.)
  • For any new or updated solution file, convert it to Markdown format, running any code within it and retaining that code's output as part of the process. This is what most people would consider the main (and longest and most significant) step of the build process.
  • For any new or updated task or topic file, convert it to Markdown format, with placeholder replacements as above.
  • Create a page for each software package in the database as well.
  • Delete any Markdown files in jekyll-input/ that were not added as part of the above steps (and were thus left over from a previous build).

Modules

The steps taken by the database_to_jekyll() function documented above use the following modules, each of which has some documentation in its source code.

  • The Solutions module can load all solutions from disk, work with their original file formats, look up or compute properties of them, determine whether they need to be rebuilt, rebuild them if so, extract the main body of the result for use in other pages, and more.
  • The Tasks module can load all tasks from disk, work with their original file formats, look up or compute properties of them, rebuild them, extract sections for particular software packages from the built result, and more.
  • The Topics module can load all topics from disk, work with their original file formats, look up or compute properties of them, find all tasks mentioned in them, rebuild them as web pages for the site or PDFs to download for offline use, and more.
  • The Software module can load all software package names from a configuration file, know which of them have Jupyter kernels (and what those kernels are named), work with sets of library names that appear alongside software names in solution file titles, look up or compute properties of each software package, and more.

There are other Python modules stored in the root folder of this repository, but they are simpler tools (e.g., for logging or reading files) and thus are not as closely related to the purpose of this project. We do not document them here.