openZIM code contribution workflow

openZIM products mostly fall into two categories: Scrapers producing ZIM files and internal tools to manage everything around them.

Because of this and the profile of our audience, we do not need to maintain different releases:

Internal tools remain unreleased, and are deployed continuously. Outside users are rare.
Scrapers are mostly used in the Zimfarm, using pinned releases. Non-Zimfarm users are tech-savvy persons due to the nature of the tools. Should a release be problematic for their usecase, they are invited to upgrade.

Our workflow is thus mostly trunk-based. We keep a single stable branch named main.

Ticket lifecycle

Features and bug fixes should respond to Issues on GitHub. Open one first if there's none. This is where we discuss the how and why. Although openZIM developers can sometimes be reached on Kiwix's Slack or via Jitsi, Github issues should archive all decisions.

Features and fixes and are developed on separate branches, following the Git Feature Branch Workflow. There is no strong naming convention for those branches, but try to short names matching the following PCRE regex ^[0-9a-z-_]+$.

Pushing to main branch is usually not permitted.

Fixing an issue implies to respect a few rules:

Check there is a ticket (bug or feature request) explaining what is needed.
Check the ticket is not assigned to someone else.
Propose your solution first in a comment before starting implementing it. This should be short an high level (better non-technical). If more information is needed at this stage.
Once there is an agreement on your proposal, assign yourself the ticket (if not already done).
If you have access to the repository, push your code on a branch there instead of a fork (if you are allowed to). Reviewers frequently push to PRs to speed-up review.
Your commits should better be atomic and have a clear short description (less than 50 characters). Put a commit description if necessary).

Code is merged-in via reviewed Pull Requests:

Create a draft PR first, without requesting a review
- Use a concise, descriptive name for your PR.
- If the PR is not straight forward or if your proposition is significantly different from what was proposed in the Issue, use the PR description to summarize your work.
- If another piece of software does not work as expected, then an issue should be open in its repository and then we should discuss if it makes sense to work on a workaround until fix is implemented.
- Third parties Exceptions should not be blindly silenced. If it's an expected one (Python relies on Exceptions a lot), handle it properly. If it's not, it's an opportunity to log a trace and properly shutdown releasing resources.
- Make sure to link the fixed Issues to the PR so those gets closed on PR merge. Write Fixes #1234 in the PR description.
- Include CHANGELOG entries to your PR (if there's one).
- Check that all automated test/coverage/codefactor are passed.
  - If you need assistance in getting that cleared, mention the maintainer in a PR comment.
Once this is OK, undraft your PR and request a review.
- You probably know who's going to review your work (the repo maintainer•s).
- If unsure, choose kelson42 who will reassign.
- Once review is requested, please do not push any code change to the branch until the maintainer has submitted its review.
  - If you need to, please send a PR comment first to halt the review and re-request/comment once you're done.
- No need to mention the reviewer separately, she/he will be notified by GitHub.
- PR review is a priority for openZIM maintainers.
- Should you not receive feedback within 3 days, mention @kelson42 on a PR comment.
- The reviewer will either approve
The reviewer will perform his duty.
- Conversations are usually started to discuss what needs to be.
- The PR author must provide feedback to all these conversations (code change and/or comments).
- Once feedback has been provided on all conversations, the PR author will re-request a review.
- It is important that the last person submitting feedback (code change and/or comments) does not mark the conversation as resolved, to ensure both parties are aligned before resolving a conversation (except if this is a clear and "easy" approval).
- This also means that it is the other party responsibility to mark conversation as resolved once everyone is aligned.
- For instance, if the PR reviewer ask for a code change, the PR author push the code change and then the PR reviewer resolve the conversation.
- Should something still have to be discussed in a conversation but it is not blocking for the change to be merged, an issue must be open to track the discussion point and the conversation will be resolved (reviewer can explicitly ask the author to do it, or the author can suggest it in a conversation)
- Avoid to rewrite commit history until the review is finished (doing this only at the end of the review helps a lot)
Should your branch be outdated due to other changes on base branch:
- Make sure to rebase off the base branch so it can be merged-in (PR author responsibility usually).
- Never merge main into your branch.
- Again, prefer to do it at the end of the review to avoid confusion between what has been reviewed and what is new changes
Once all conversations are resolved (or if none are needed):
- You may be asked to rewrite your commits to clean-up the commits history.
- The reviewer will approve the PR and he will merge the branch
- Even if PR author is also a maintainer and could merge as well, it is the reviewer responsibility to merge, to ensure that no code is merged too soon

Regressions

Trivially phrased a regression is when: "it was working and it's not anymore". Usually, such bad behavior takes its root in the code (introduction, via a PR, of a bug on a preexisting feature), but under certain circumstances it can be caused by changes in the environment (for which the code is not adapted anymore). In both case this is an impairment for the end-user.

A regression does not have to be very bad, as such it says nothing about the level of impact of the bug. Having a regression, beside the user impact, means that the level of control around the code development process may be insufficient. Therefore, such an event is a warning.

Each regression should lead to:

An issue clearly identified as such, via the regression tag
Fix should be implemented in priority (ideally by the developer who has introduced it)
Explanation about the reason why/how it has been introduced
Ideally a proposal or implementation of a solution to avoid such regressions to happen again (usually via an improvement of the CI tests)

About your code

openZIM development is tailored for low-maintenance and low entry barrier as our team is small and changing with the coming and going of volunteers.

We need our code to be easy to understand and work-with. Keeping the same code stack accross projects allows developers to work on all products and limits the skill set required to contribute. It also allows us to share bits of code, such as with zimscraperlib and kiwixstorage.

Container-first. We deploy everything through containers:
Set appropriate return code.
Print usable logs to STDOUT and STDERR
Handle TERM/INT/HUP signals.
Write to user-customizable locations so users can mount those.
Development environments are Linux and macOS.
Emails are sent using Mailgun API.
Scrapers cache and large files are stored on S3 Buckets at Wasabi. See our S3 Cache Policy.
don't include external dependencies in the repo. Repo should only contain our code. Most products include a setup script that downloads/installs such dependencies.
Databases to use MariaDB using databases+ormar[mysql] or SQLAlchemy; using migrations (alembic). We are moving away from MongoDB because of its high RAM usage, complexity to use in relatively-common aggregation scenarios, limited NoSQL need and lack of support for our backup-tool. Temporary (not backed-up) DB to use SQLite.
CI is done using Github Actions. Code coverage is watched by codecov.
We use Continuous Deployment. docker-publish-action will build and push your image.

Python

Backend and scrapers are written in Python.

Python 3.6+. With the introduction of 3.10 and the removal of 3.6 from GH Actions's macos-11 runner, we'll probably start 3.8+ where we see fits.
Web APIs written using FastAPI. We are moving away from Flask to save manually maintaining OpenAPI spec and benefit from perf. improvements, websockets and simpler testing.
API access restrictions code should be permission-based, not role-based. Roles can wrap permissions though.
HTTP communications using requests. Keeping an eye on httpx.
Coding style:
- black, isort, Flake8 are mandatory.
- since we use black, you have to tell isort about it (with isort --profile black) and adjust flake8 maximum line length (with flake8 --max-line-length=88)
- type hints and docstring unless obvious.
- concise, useful docstrings (PEP257): Short summary of block on first line, details on subsequent ones if necessary. Don't duplicate signature (already done via type hints).
- comments on unexpected behaviors
- 4 spaces, no-tab indentation.
- if you use Visual Studio Code as an editor, you might use these settings (see Gist comments for details)
Database is PostgreSQL, with SQLAlchemy (Core + ORM) for data access and Alembic for schema migrations (for the curious, more on this choice in Python SQL tookit)
Tests written with pytest. Coverage policy depends on projects. Scrapers are usually not unit-tested but can have full ZIM-creation runs in the CI. APIs and libs should be tested as much as possible.

Frontend

Frontend API consumers are built with Vue.js.

VueJS 3, Vue Router 4, Pinia 2 and axios for HTTP communications.
Bootstrap 5 for UI. FontAwesome (via vue-fontawesome) for icons.
Vuejs Coding Style, 2 spaces, no-tab indentation (idem for HTML and CSS).
Public websites should be hooked to stats.kiwix.org via vue-matomo.
Container image:
small, alpine-based nginx serving the static build of the project (using an intermediate builder)
constants such as API URL should be passed via environment variables: entrypoint to write it to a JSON file read by the App.
We are yet to do unit-tests for frontends. Want to suggest a tool set?

Note: our experience with Vue.js is limited. Suggestions welcome!

Tool stack is not a religion. We follow long-term trends to gain efficiency and match real-world available skills. Feel free to suggest different tools and workflows.

ZIM Basics

The ZIM file format (spec) is a custom binary format designed for storage of online website replicas meant to be accessed offline.

A ZIM file requires a ZIM reader to be used. Kiwix creates and distributes ZIM readers for various platforms. There are mobile ones (iOS and android), desktop ones (Windows/Linux and macOS as well as a Javascript one.

The most popular one though is kiwix-serve which is an HTTP server serving ZIM files. It is packages in most of those readers and also available as a standalone cli version inside kiwix-tools.

Although it's technically different, you can imagine a ZIM file like a ZIP file, with entries representing files. A ZIM file is composed of many Entries, referred to by a Path. A path can be anything like hello or a/b/c/report.pdf (notice there is no / prefix). When a ZIM file is created, it sets which Entry is the MainEntry, which the ZIM reader is expected to open first.

Entries can be of any type: PDF, video, CSS, ZIP or anything. Being designed for offline website use, it is common to have at least one HTML entry to serve as main entry. ZIM files of Wikipedia are composed of a bunch of CSS/JS entries, then one HTML entry for each article and one entry for each image in the encyclopedia.

ZIM can also have Redirect entries that just points to other entries.

In addition to those Content Entries, ZIM files also contains two xapian indexes: one containing all the public facing (FRONT_ARTICLE) Entries and another one with a full-text index of the content used by the Search Engine. Scrapers have the ability to specify which Entries are FRONT_ARTICLE and what to feed the full-text index with.

Last important kind of Entries is Metadata. While they are as flexible as other entries, those are used mostly as a text key-value mapping to expose ZIM metadata. The spec defines which Metadata are required (Title, Language, etc) and which other ones are expected or in-use.

Tags Metadata for instance is relied-upon to classify ZIM files in readers: there is no other classification mechanism. See spec for directions on how to use them and a list of internal tags we use to expose per-ZIM features.

Another important Metadata is the Illustration (used to be called Favicon). A mandatory 48x48 PNG image is expected at Entry Illustration_48x48. It's the icon that readers display.

Toolbox

library.kiwix.org is a kiwix-serve offering most of the ZIM files we produce. You can access those online directly.
download.kiwix.org/zim is where you can download those ZIM files.
download.kiwix.org/release is where you can download readers and other tools.
kiwix-serve (in kiwix-tools) is a must have to test and inspect ZIM files.
zimdump (in zim-tools) allows you to dump the content of a ZIM file into a folder.
libzim is the easiest way to inspect ZIM files, if you're comfortable with Python.

kiwix-serve provides access to Metadata as well

kiwix-tools and zim-tools are only released for Linux at the moment. Either compile it (see kiwix-build) or use Docker images from openzim and kiwix.

Read metadata of a ZIM file

Using kiwix-serve's /meta endpoint. ⚠️ subject to change: see libkiwix#631.

curl http://library.kiwix.org/meta?content=wikipedia_pag_all_maxi_2021-10&name=title
curl https://dev.library.kiwix.org/meta?name=Illustration_48x48@1&content=devdocs.io_en_all_2021-05

Using Python libzim, via zimscraperlib

pip install zimscraperlib

from zimscraperlib.zim import Archive
zim = Archive("path/to/file.zim")
print(zim.metadata)

Extract ZIM content to file system

zimdump dump --dir /data/dump /data/khanacademy_fr_kolibri_2021-11.zim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly