Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a multi-stage build process. #475

Merged
merged 27 commits into from
Sep 24, 2021
Merged

Use a multi-stage build process. #475

merged 27 commits into from
Sep 24, 2021

Conversation

gouttegd
Copy link
Contributor

This PR completely overhauls the way both odkfull and odklite are built, in order to:

  • reduce the size of each image (Shrink ODK image #217);
  • on arm64, allow to install from source some programs that are not available as pre-compiled binaries (e.g. Soufflé, Include soufflé? #459), without cluttering the final images with build-time dependencies.

Along the way, it also includes the proposed freeze of Python packages (#463), fixes the broken Konclude (#473), and adds some basic checks to ensure the various programs of the ODK are installed correctly (#472).

Basically, building odkfull is now a 3-steps process:

(1) First, an intermediate image called odkbuild is built. This image serves as a sort of "staging area", where we can install all the build-time dependencies we want (without worrying about image size) and build everything we need. This includes installing the Python modules, because some of them have to be compiled on arm64 for lack of pre-compiled arm64 wheels.

(2) Second, odklite is built, using some of the tricks proposed in #471 to keep the image size minimal (mostly forbidding apt-get from installing "recommended" packages and using a headless variant of the JRE). That image contains ROBOT, DOSDPTOOLS, OWLTOOLS, and Ammonite, along with the minimal set of Python modules required to run odk.py. It is intended that this image should be sufficient for use with a "standard" ODK-generated repository. Size of the image: ~650 Mb.

(3) Third, odkfull is built on top of odklite (which is now, by construction, a strict subset of odkfull), adding Konclude, Soufflé, Jena, SPARQL-prog, along with a richer set of Python modules and build tools. Size of the image: ~1.8 Gb.

Make the odklite image the minimal image needed to use a "standard"
ODK-generated repository.
Update the run.sh wrapper script to allow choosing the ODK image to use
without having to edit the file. Use the odklite image by default.
Use constraints to fix the versions of Python packages (constraint file
generated by 'pip freeze').
Avoid messing with the default /usr/bin/python executable. Just make
sure to always call python3.
Update the version number in docker/odklite/Makefile (used when not
called from the top-level Makefile).

Define ODK_VERSION in the environment from odklite.
Build the odkfull image on top of the odklite image. Use a multi-stage
build to install SWI-Prolog and Soufflé.

Do not build the obolibrary/robot image, as it is redundant with the
obolibrary/odklite image (which is now barely larger).
There's no pre-compiled binary for fastobo-validator targeting the arm64
architecture, so we build it ourselves on the builder image.
Konclude is broken an all platforms: on x86_64, the version we used to
install depends on the musl C library that is not provided in the image;
on arm64, we don't have a compatible binary.

On x86_64, we now install a more recent version of Konclude that depends
on the standard GNU libc. On arm64, we compile Konclude ourselves.
Docker should default to build an image targeting the architecture it is
currently running on, but for some reason on M1 that does not seem to
always work and it may sometimes try to build an amd64 image instead
(thus causing all sorts of failure down the line). Better be explicit
and say which architecture we are targeting.

The default is taken from the OS system's uname command, so it *should*
always give the expected result without any user intervention. If
needed, the user can override the automatic detection of architecture by
assigning the ARCH variable herself when calling make.
Remove useless '>= 0.0' version qualifiers and remove packages that are
not strictly needed in a standard ODK-generated repository.
Move all the build steps to a dedicated "builder" image called
"obolibrary/odkbuild".
Create separate rules to build the builder image and the odklite images,
and adjust the dependency chain so that odkfull depends on odklite which
depends on odkbuild.
Even though the builder image is not intended for public use, we may
have to push it to the Docker Hub for building multi-arch images, so we
might as well try to reduce its size if we can.
Make sure to build the builder image (as a multi-arch image itself)
before attempting to build the final multi-arch images.

Because it does not seem there is a way to load a multi-arch image
locally, the builder image will have to be pushed to the Docker Hub
before it can be used to build the final odklite and odkfull images.
This is silly and inefficient, but Docker does not offer any other
solution.
Even though odklite is now intended to be sufficient to run a standard
ODK-generated repository, it is probably to early to push such a change.
We should collect some return on experience from ODK users first.
When installing the Python packages for the odkfull image, pip is
ignorant of the packages we have already installed for the odklite
image, so some packages already installed in the odklite staging tree
may be installed again in the odkfull staging tree, if they are
dependencies of the odkfull packages. This would result in artificially
and needlessly increasing the size of the odkfull image, because those
Python packages will be present in two different layers.

To avoid that, we forcibly remove from the odkfull staging tree any file
that is already present in the odklite staging tree.
The standard ODK Makefile will soon use curl in some rules, so it needs
to be part of the odklite image.
Check that the various programs installed as part of the ODK can be run
from the image. This should catch the most obvious installation
problems, such as a binary with the wrong architecture (as happened with
Konclude recently) or a missing shared library.
When running the test suite, if the image to be tested is not locally
available, print a message to say so instead of silently terminating.
Update the README files in the docker directory to reflect the recent
changes in the different images.
@matentzn
Copy link
Contributor

This is so awesome, I cant contain myself. Running builds locally now!

Do not mandate package versions in requirements.txt. Use the
constraints.txt file to freeze version numbers.
When cleaning up redundant packages in the odkfull staging tree, make
sure to remove empty directories (this was not the case because rmdir
was called with an invalid -f option).
Since the ODK build process is now a bit more complex, add some docs to
explain how and where to add a new component when needed.
@matentzn
Copy link
Contributor

Ok I did a few checks locally.. I am wondering if more needs to be done. If you have no reasonable concerns, I am happy.

I:

  1. rebuild multiple times
  2. ran ODK on cell (ok)
  3. ran ODK on dashboard (ok)

Happy to merge if yo uare.

@gouttegd
Copy link
Contributor Author

The one thing that still needs to be tested is building multi-arch images. It should work as expected (i.e., the build-multiarch.yml Github Action should still do the job automatically upon the next release), but there's still a possibility that the Github builders would somehow be unable to deal with multi-stage images. I didn't find any report suggesting that would be the case, but who knows…

If you agree, what I would like to do is to temporarily bump the version number to "1.2.30-alpha" and manually trigger the build-multiarch action on this PR, so that we can check that the multi-arch build process still works. (The rationale for adding an "-alpha" suffix to the version number is that the generated images will end up on Docker Hub – that's the whole point of the build-multiarch action after all –, so we need to make sure users know the new images are not an actual release.)

@matentzn
Copy link
Contributor

I am not worried about the -alpha; if the build succeeds and the image is pushed, then a docker pull will obtain that image, wether the tag says alpha or not, right?

@gouttegd
Copy link
Contributor Author

I was hoping there would be a way to mark an image on Docker Hub as "beta", "not-production-ready", or anything like that, but that does not seem to be the case. Scheiße.

The next best option is then to publish under the obotools name (where we initially published the M1-specific images when we were still figuring out how to build multi-arch images) instead of the obolibrary name. I'll do that from my own repo, which is already set up for publishing in obotools.

@matentzn
Copy link
Contributor

Ok, if this works as well, great.. I would anyways like to aim for an ODK update soon, but want to deal with some of the smaller tickets first. Maybe first week of october.

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 22, 2021

GitHub multi-arch builders are so slow … 5 hours in and it's still building the odkbuild image, which is the first step of the build process…

Use the same name as in the other Makefiles and make sure to push the
images to the hub.
@gouttegd
Copy link
Contributor Author

… And the build was forcibly cancelled because it was taking too long… There's a 360 minutes limit for each job, and building the odkbuild image takes longer than that on GitHub builders (mostly because of the need to build Konclude from source on arm64).

There are a few things we can try to circumvent this, such as creating another intermediate image just to build Konclude (so that it is built in a different job, which would be subjected to its own time limit), but I am very reluctant to make the build process even more complex than it already is just to accommodate GitHub's arbitrary limits… Maybe we should just go back to building the images-to-be-released on one of our own machines and forget the build-multiarch GitHub Action entirely.

@matentzn
Copy link
Contributor

It's a pity but I don't mind. I have servers etc that do this work for me! Whatever you think is best. I have no problem building locally and pushing!

@gouttegd
Copy link
Contributor Author

I'd say merge as it is then. At the next release, we can try again the automated build, and if it fails again (which is likely, unless in the meantime GitHub switches to faster machines for its builders), you can build and push the images yourself. Let me know if you need any help setting up your local builder for multi-arch builds, if you have not done so already.

When building the builder image, remove the source code and object files
of the projects we are compiling (SWI-Prolog, Soufflé, Fastobo,
Konclude), and only keep what has been installed in the staging area.
This saves more than 1 GB in the odkbuild image.
@gouttegd
Copy link
Contributor Author

I've just pushed one last bit to remove intermediate object files from the odkbuild image. This is not strictly necessary but it reduces the size of odkbuild by up to 1 GB.

I am done tweaking things with that PR, I promise! :)

@matentzn
Copy link
Contributor

Thank you running tests now :) Is it ready to merge then? Are you happy? If my tests pass here, I will just merge it then.

@gouttegd
Copy link
Contributor Author

Still disappointed by how slow the GitHub builders are, and the subsequent inability to use the build-multiarch action, but yes, I am happy with that PR. :)

@matentzn
Copy link
Contributor

First fail in my third server:

Get:95 http://archive.ubuntu.com/ubuntu focal/universe amd64 dos2unix amd64 7.4.0-2 [374 kB]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-libc-dev_5.4.0-84.94_amd64.deb  404  Not Found [IP: 91.189.88.152 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Fetched 69.8 MB in 7s (9826 kB/s)
The command '/bin/sh -c DEBIAN_FRONTEND="noninteractive" apt-get install -y --no-install-recommends     build-essential     openjdk-8-jdk-headless     maven     python3-dev     subversion     automake     aha     dos2unix     sqlite3     libjson-perl     pkg-config     xlsx2csv' returned a non-zero code: 100
Makefile:69: recipe for target 'build' failed
make: *** [build] Error 100

Lets see whether it was a fluke

@matentzn
Copy link
Contributor

Unfortunately it was not.. building the odklite does not work on my Ubunto machine fails..

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"

I am not too worried but so far, I was able to run the builds there.. What is your opinion?

@gouttegd
Copy link
Contributor Author

I can't reproduce the error here, whether on my MacOS machine or my Linux machine… What's the Docker version on your Ubuntu 18.04 machine?

@matentzn
Copy link
Contributor

I completely uninstalled and reinstalled docker:

Docker version 20.10.8, build 3967b7d

@gouttegd
Copy link
Contributor Author

So that's the same version than the one on my MacOS machine, and an even fresher version than the one on my Linux machine… Sorry, I have no idea what could cause the problem here. It looks as if apt-get was working from an obsolete list of packages, but the cache is updated at the beginning of building the odklite image, so that shouldn't happen…

@matentzn
Copy link
Contributor

Maybe its OS version related somehow in some weird way we don't understand. Lets merge it; works on two of my machines so its fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants