Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue278 dockerize annif #279

Merged
merged 66 commits into from
Jun 26, 2019
Merged

Issue278 dockerize annif #279

merged 66 commits into from
Jun 26, 2019

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented May 21, 2019

Different Dockerfiles for creating images.
Based on python:3.6-slim (138 MB):

  • Annif with all backends (Dockerfile, 1.19GB)
  • Annif with only TFIDF backend (Dockerfile-slim, 653MB)
  • optional dependencies/backends for Annif (Dockerfile-optional-dependencies, 1.19GB)

Based on python:3.6-alpine (79.1MB):

  • Annif with only TFIDF backend (Dockerfile-alpine, already 1.24GB, and building takes ~15 min)

Alpine approach seems quite tricky regarding all backends and size increase (and the alpine base image is only 60 MB smaller than python-slim). Maybe best is to proceed on python:3.6-slim based images.

Tests pass in all containers.

There was a problem using Vowpal Wabbit (related to libboost-python-dev, which has dependency on Python2 and also wrongly linked to it). Using VW version 8.4 solves this.

Few TODOs in Dockerfiles for shrinking image size.

Usage of containers is e.g. like this (or, actually, maybe the container doesn't need to be running all the time...):

  1. Start bash in a container running in background (bind-mounting vocabylary and training data):
    docker run -v /home/local/jmminkin/data:/data -itd --name annif_container annif bash

  2. Execute commands in the already running container:
    docker exec annif_container annif
    docker exec annif_container annif loadvoc tfidf-en /data/vocab/yso-en.tsv
    docker exec annif_container annif train tfidf-en /data/training/yso-finna-en.tsv.gz
    echo This is the sample text to run suggest test on. | docker exec -i annif_container annif suggest tfidf-en

@osma osma added this to the v0.41 milestone May 22, 2019
@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #279 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #279   +/-   ##
=======================================
  Coverage   99.31%   99.31%           
=======================================
  Files          52       52           
  Lines        2624     2624           
=======================================
  Hits         2606     2606           
  Misses         18       18

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffc947f...61f7345. Read the comment docs.

@osma
Copy link
Member

osma commented May 22, 2019

Excellent!

In the end we probably want a single Docker image which contains all optional dependencies. The TFIDF-only variant is only useful as an intermediate step, it's unlikely to be useful on its own.

Basing on python3.6-slim seems like the best choice. With Alpine it's going to be difficult to install all the dependencies, and the image size is already huge without them for some reason.

I think we also want to include a WSGI server in the image, for example uwsgi or gunicorn, so that we can run a production web server. Flask includes a simple HTTP server (annif run starts it) but it's intended for development only. Here's a tutorial for uwsgi. We probably need nginx too though I'm not 100% sure since I've only used Annif via WSGI with Apache mod_wsgi.

We also need documentation in the wiki for how to run Annif via Docker, something similar to this page on the Skosmos wiki.

@kinow any comments?

(closes #278)

@kinow
Copy link
Collaborator

kinow commented May 22, 2019

@osma I saw this commit this morning and was really happy as this would make it easier for some devs - myself included - to run and contribute to Annif.

I noticed the several images, which is not a problem, if the intention is to allow users to choose the flavour, and use the image not only for development, but also in production - the Skosmos one was created with development in mind, but with some polishing should be able to run on Amazon, Docker Swarm, etc.

Only comment I have is that it would be good if the documentation included how to map the volumes needed for Annif. It would be really great if I could have all the corpora and configuration files needed in some volume, map to one or multiple containers, and be able to switch branches while developing, re-using the Docker files without destroying the data. 🎉

@osma
Copy link
Member

osma commented May 22, 2019

Thanks for chiming in @kinow!

I think we want to end up with one or maybe two Dockerfiles, not more, but it's good to test different variations first.

I agree that at least for development, the configuration and data files should live outside the Docker image, or at least that should be the default option. For some deployment scenarios it might actually be desirable to include the config and data (pretrained models) within the container, but I suppose we could support both.

@osma
Copy link
Member

osma commented Jun 19, 2019

Thank you @natlibfi-arlehiko for your comments. Some responses:

Use only one RUN instruction in Dockerfiles to prevent creation of multiple layers. Or better yet, use multi-stage builds and copy the relevant files from earlier build stages

The current Dockerfile uses a multi-stage build already.

Run Swagger UI in a separate container

Swagger UI is built in to Connexion, which is a core component of Annif. I don't see how to easily move it to a separate container.

Do the actual training separately and make the actual Annif container use the resulting data

This is the current plan. The main container only includes the software itself, not the data which is mounted separately.

Some of the Python dependencies (Which I image are required for the training) seem to be pretty large:

Unfortunately it seems difficult to get rid of any of these, @juhoinkinen already tried excising the most useless ones, without success.

@juhoinkinen Do you think this is starting to be ready for merging? The PR is still marked as draft but I don't see any major actionable TODOs in the recent discussion.

@natlibfi-arlehiko
Copy link

Swagger UI is built in to Connexion, which is a core component of Annif. I don't see how to easily move it to a separate container.

The way we do it in Melinda's record import API:

  • The REST API serves OpenAPI JSON with GET /
  • Run a separate Swagger UI container that fetches the OpenAPI spec from the previously mentioned REST API

Might be possible with connexion too.

@osma
Copy link
Member

osma commented Jun 19, 2019

@natlibfi-arlehiko I see. That's good from the perspective of separating concerns between containers, but I don't think it will reduce the container size, since Connexion will include Swagger UI regardless of whether we enable it or not. Did I understand correctly?

@natlibfi-arlehiko
Copy link

but I don't think it will reduce the container size, since Connexion will include Swagger UI regardless of whether we enable it or not. Did I understand correctly?

Yeah. Good point :D

@juhoinkinen
Copy link
Member Author

@juhoinkinen Do you think this is starting to be ready for merging? The PR is still marked as draft but I don't see any major actionable TODOs in the recent discussion.

Yes, I just updated the wiki https://github.com/NatLibFi/Annif/wiki/Usage-with-Docker about the docker-compose to combine Annif+NGINX+Mauiservice (before there were two docker-compose files, one for Annif+NGINX and other for Annif+Mauiservice). The current docker-compose.yml is now like a template, from where unnecessary services can be removed if wanted.

There is also docker-compose-portainer.yml, but that is used to deploy the services in our own platform/instance, and there is still work to be done, but that maybe is not part of this pull request.

@juhoinkinen juhoinkinen marked this pull request as ready for review June 20, 2019 13:41
@osma
Copy link
Member

osma commented Jun 24, 2019

Looks good now. My only concern is that the Drone build is failing according to the above status display. I can see there is this error at the end of the Drone docker log:

unauthorized: access to the requested resource is not authorized
time="2019-06-20T13:43:08Z" level=fatal msg="exit status 1" 

Do you need to add some kind of credentials somewhere?

@natlibfi-arlehiko
Copy link

Looks good now. My only concern is that the Drone build is failing according to the above status display. I can see there is this error at the end of the Drone docker log:

unauthorized: access to the requested resource is not authorized
time="2019-06-20T13:43:08Z" level=fatal msg="exit status 1" 

Do you need to add some kind of credentials somewhere?

Fixed.

@osma osma merged commit a2d3be4 into master Jun 26, 2019
@juhoinkinen juhoinkinen deleted the issue278-dockerize-annif branch August 15, 2019 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants