Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate conda-store to understand if / how it could be used in JupyterHub #786

Closed
2 tasks done
yuvipanda opened this issue Oct 28, 2021 · 12 comments
Closed
2 tasks done
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Oct 28, 2021

Description

For https://github.com/2i2c-org/meta/issues/252, 2i2c-org/features#3, 2i2c-org/features#6, and pangeo-data/jupyter-earth#79, we want to spend some time investigating conda-store. We should have a deeper understanding of what it is and how it works, so we can figure out where (and if) it can be helpfully deployed for us. It's also important to see if we can contribute back upstream productively, as community ownership & governance is very important for long term sustainability of projects.

Yuvi will try to write a TLJH plugin to get conda-store to work with TLJH. The goal of this exercise is to learn more about conda-store and how it might be used in a JupyterHub. This helps him understand what it takes to deploy the simplest possible production JupyterHub with a working conda-store implementation, and might independently be useful for TLJH users as well.

Value / benefit

Getting conda-store to work with tljh helps me fill out https://hackmd.io/_hQdrilJQFCGExUqYgwY3Q?edit with a quick evaluation of conda-store, to help understand how it can fit our use cases. It also helps figure out how upstream collaboration would work, as we'll need to send patches upstream for sure - conda-incubator/conda-store#196 is a start.

Ultimately, this will help us figure out how much resources we can spend on improving and using conda-store.

Tasks to complete

  • Try implementing a TLJH plugin for conda-store
  • Write up major things that were learned along the way

Updates

@yuvipanda yuvipanda self-assigned this Oct 28, 2021
@costrouc
Copy link

@yuvipanda please let me know how I can help and feel free to open issues on conda-store regarding the integration. Also happy to meet over a call sometime if you would like.

@dharhas
Copy link

dharhas commented Oct 28, 2021

community ownership & governance is very important for long term sustainability of projects

This is missing from conda-store right now but is the direction we want to go.

@costrouc
Copy link

@yuvipanda I commented on the hackmd as well.

@costrouc
Copy link

Additionally @yuvipanda I've completed the systemd example of running conda-store which might help you with tljh pluggin. See https://github.com/Quansight/conda-store/tree/main/examples/ubuntu2004. It has the systemd configuration files along with showing a minimal setup.

@yuvipanda
Copy link
Member Author

Thanks a lot for the merge and the new release, @costrouc! I'll continue to open issues and PRs as I go along :)

And thanks for responding on the hackmd too! I'd love for you to transition to FastAPI (or another async framework) - I think from sync to async, especially in a process involving db transactions, becomes difficult after it reaches critical mass. And currently, almost all the server side software in this space (dask-gateway, jupyter, etc) are async, so it would be great to follow suit - helps with code re-use too wherever possible.

@yuvipanda
Copy link
Member Author

So, the hub environment in TLJH uses a virtualenv, and conda isn't really installable in anything other than conda environments. The setup.py doesn't actually work in a fresh python environment due to conda/conda#10691. I fell into that rabbit hole a tiny bit (see conda/conda#11014) but have pulled back. I think conda-store can't run in the same environment as the hub as that is a virtualenv, but we can create a new conda env for conda-store to run out of instead.

@damianavila
Copy link
Contributor

damianavila commented Nov 2, 2021

I think conda-store can't run in the same environment as the hub as that is a virtualenv, but we can create a new conda env for conda-store to run out of instead.

How about installing miniconda as part of the bootstrapping TLJH is doing? Then you can create the conda env and install conda store on it (modulo conda-store does not need to run from the base environment which would be already available after installing miniconda, so it should not be an issue, I think).

@yuvipanda
Copy link
Member Author

@damianavila that's what I ended up doing https://github.com/yuvipanda/tljh-conda-store/blob/0ea1a4ad6018447f995deb02b571f740dffaee40/tljh_conda_store/__init__.py#L40

@damianavila
Copy link
Contributor

damianavila commented Nov 3, 2021

So, mamba is there, nice!
Btw, the linked code makes sense to me.

@choldgraf choldgraf changed the title Investigate conda-store by writing a TLJH conda-store plugin Investigate conda-store to understand if / how it could be used in JupyterHub Nov 10, 2021
@yuvipanda
Copy link
Member Author

Deeply tied into conda

As is clear from the name, conda-store is deeply tied into conda - it's
even part of all the database schemas. This is its core value proposition -
being tied into conda means it can provide perfectly reproducible environments
by taking advantage of how conda works (deeply inspired by the Nix ecosystem, of course).
However, this is also a disadvantage - as far as I can tell, you can't really
step outside of the conda ecosystem. Based on my experience helping set up
environments in educational and some research spaces, this has two main issues:

  • It leaves R users out in the cold a little. I know there's R support in conda-forge,
    but R users generally prefer installing things from CRAN.
  • Any other customizations (such as postBuild) don't seem possible here.

But maybe I'm looking at this the wrong way totally, and an analogy to repo2docker (which
builds docker images, rather than environments) isn't quite right. But given that
is my baseline, I think the lack of support for things that aren't just in conda
is a serious limitation. This isn't a dig at the folks doing wonderful work in
the R ecosystem for conda, but a reflection of current observed preferences of R users.

Users can make as many conda envs as they want!

conda-store ships with its own concepts of namespaces (users), and environments
stored in a db. So individual users can create environments, and provide appropriate
permissions to other users on the hub to use them. This is very helpful when you have
one big hub that is used by a lot of fairly advanced users doing their own thing - as
you might in an enterprise organization. This is my favorite feature, but it also scares
me - it's versioning and storing environment definitions in a database, where I'd prefer
to keep that in something like git. Either way, this is the exciting part of conda-store,
and something I hope can be replicated in other tools that are more generic.

Many moving pieces

I tried to make the simplest possible setup for TLJH, but for production use outside
I think the following separate processes will need to run?

  1. conda-store API server
  2. a postgresql database
  3. conda-workers for actually building the environments
  4. A message queue (like redis) for communication between the api server and the celery workers

I've run celery based setups with message queues in the past, and they do work great.
But I try to avoid running them wherever possible :D In binderhub, we simply use kubernetes
directly with an idempotency property rather than celery for similar effect. I am always
just a little bit worried about the extra complexity a messaging system brings...

Next steps

I think next step is to find a community that has pre-existing users who are
interested in creating a lot of varied custom conda environments to share
amongst themselves, but don't use R, and then possibly try rolling this out. I
don't think any of our current communities fit the bill. So I think from the 2i2c
side, now we just wait until someone else asks for this. Our current efforts
are probably better put towards moving more folks into https://github.com/jupyterhub/repo2docker-action
and perhaps https://github.com/yuvipanda/jupyterhub-configurator.

I'd also like to actually finish tljh-conda-store, so I can form more opinions from
actually trying it out and using it.

@choldgraf
Copy link
Member

Thanks for this update @yuvipanda - it sounds like next steps with conda-store need to wait for finding a community that has the right use-case for it. I'll close this one and we can open a new one to track new implementation when the time is right. If you'd rather keep this one open feel free to re-open!

@damianavila
Copy link
Contributor

@yuvipanda, first of all, thanks for the summary! I think the next step section is a reasonable one.

Our current efforts
are probably better put towards moving more folks into https://github.com/jupyterhub/repo2docker-action
and perhaps https://github.com/yuvipanda/jupyterhub-configurator.

Makes sense to me.
Although I would love to read more about your experiences with tljh-conda-store in the future!

In binderhub, we simply use kubernetes
directly with an idempotency property rather than celery for similar effect.

Btw, do you have a link that I can take a look at about this piece? 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants