Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debrief from PaleoHackWeek hub #260

Closed
choldgraf opened this issue Feb 21, 2021 · 8 comments
Closed

Debrief from PaleoHackWeek hub #260

choldgraf opened this issue Feb 21, 2021 · 8 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Feb 21, 2021

Summary

We ran a hub for the PaleoHackWeek hackathon, in partnership with @CommonClimate and @khider. The event went great overall, but there are some things that we can learn from it!

cc @yuvipanda and @GeorgianaElena - perhaps we can discuss the summaries below and then decide how we'd like to spin off issues for these? Please provide any extra edits that clarify things below! and @CommonClimate + @khider please feel free to provide any clarifications of your own, or new ideas as they came up!

What went well

  1. Paleohack organizers were able to control which users can log onto the system without involvement from 2i2c organizers.
  2. @khider was already familiar tools like xarray, netcdf4, docker, etc - so she was able to help debug some issues that surfaced once the notebooks actually started running.
  3. For the most part, everything went really smooth!
  4. We were running on a brand new Google Cloud project, and we were able to setup the cluster with minimal manual work on our part. There were also no cloud quota issues, which was great.

What went poorly

  1. We did a deploy during the hackathon to increase user resources (Increase RAM for paleohack2021 #256). However, because of a misunderstanding about how much resources were actually on each node, this caused new user pods to stop spawning - we were requesting more memory than was available!
  2. A deploy from a laptop to quickly fix the above issue with another PR (paleohack2021: Change limits #257) triggered a proxy pod restart, since we have a different PROXY_SECRET_TOKEN for local machines (see Document PROXY_SECRET_KEY #116). Merging the PR triggered another proxy pod restart, causing the hub to basically stop working while the proxy restarted and was repopulated with appropriate routes.
  3. We had set a default limit of 10 user nodes as maximum for autoscaling. This was hit pretty early on in the hackathon, since they had more users than would fit on 10 nodes. Had to be bumped manually.
  4. The home page for the hackathon hub wasn't as useful as it could have been. No references to nbgitpuller, where to go to support, etc. Was just our generic hub home page template, which is more focused on educational hubs.

Speed bumps

  • The relationship between the user environment and content in it was not clear. mybinder.org is the common reference point for most people, and it doesn't really separate these two. We were building the image off this repo, and the content was in that repo as well. Updating content didn't require any work from 2i2c staff, but updating the environment did. Updating content required a re-click of the nbgitpuller link, while updating user's environment required they start / stop their server. This was very confusing!
  • Amount of resources (RAM, CPU) required was hard to estimate, since people don't usually think of this when running on their laptop. There isn't a very clear way to do this, so it was trial and error.
  • Our terraform code to deploy node the cluster does not allow for creating a new node pool without deleting the current user node pool. Deleting the current user node pool will disrupt currently running users. We hacked around this by manually adding another node pool to the terraform code temporarily.
  • We didn't have a smooth process for support from 2i2c staff during the hackathon. This ended up happening on the 2i2c slack via private messages ad-hoc. We should have a better process for this.
  • During first login, new users had to wait for a few minutes as new nodes were spun up for them. This was made a bit better by enabling user placeholders to keep a 2 node headroom, but something that can be controlled by the admins (like this) would have made this much easier.
  • We didn't have a super clear idea of how to get the cheapest 'base cost' - core node size and configuration. We ended up with e2-highcpu-4, need to investigate if that is the right thing.
  • We setup the NFS server manually. We shouldn't have had to.

Where we got lucky

  • With regular, non HA 'zonal' clusters, modifying the cluster (including adding new node pools) makes the kubernetes master unavailable for a minute or more. This isn't a problem with regional clusters. We had to change user node size just before the hackathon started, and it worked out ok. Otherwise, new users would not have been able to log in for the duration of this operation - which is often less than a minute, but non deterministcally sometimes can be much larger.
  • Some users were getting a 'Server Error' when trying to view contents of directories. This was intermittent - a refresh often fixed the issue. Nothing was in the logs, and we never found the root cause of this. Could have been a lot worse!
  • It was unclear what the right size for the user node should be. It's a trade-off between cost, maximum possible resources for a user (which is limited by node size), memory / CPU ratio and autoscale performance. We ended up using n1-highcpu-8 CPUs, which were ok for the determined 2 CPU, 2GB RAM per user resource requests. But this had to be upped during the course of the hackathon, and we got lucky that each user didn't need more than 4G of RAM.
  • We had a prometheus / grafana setup, but we never actually used it. The grafana wasn't set up properly since we don't have an easy way to do that. The prometheus was also probably overresourced for this setup.
  • We 'guessed' how much home directory storage was needed, and picked 100G of standard disk with a very small server. This might've not been enough, and could've been the cause of the 'Server Error' some folks experienced (from looking at stackdriver metrics). Could also have been that we didn't get enough iops because our disk was too small. It was also ext4, and resizing in xfs seems much more supported. We got lucky none of these became a real problem.

Action items

Process improvements

  1. Process for communication and escalation during a hackathon between 2i2c and organizers
  2. Process for home page customization so it is clear for first time users what they should be doing
  3. Recommend we use regional clusters for all hackathons, as any downtime is unacceptable
  4. Process for deciding on initial set of requirements - CPU, RAM, Disk Space - before the hackathon. We can be nimble about modifying them once users start using it.

Documentation improvements

  1. How nbgitpuller and image building work, so people understand what is happening when
  2. How to test the built image locally with appropriate resources for iteration
  3. How to measure memory & CPU usage locally so they can inform how much cloud resources we provide
  4. How to setup autobuild on quay.io for your image
  5. How manual deploys work, so others can do that in a pinch if necessary (Document manual deploys with deploy.py #113)
  6. Document trade-offs between user node size, resource requests, density and cost so we can pick what fits the particular project better.

Technical improvements

  1. Terraform should be more flexible wrt user pools, so we can add more without having to delete the existing one.
  2. Admins should be able to control user placeholder pods, memory & cpu usage so they can modify that without needing 2i2c staff intervention
  3. We moved the paleohack entry in hubs.yaml (Move paleohack2021 to top of list #246) since full deploys take time. We should make these deploys be much faster!
  4. Figure out the cheapest way to run our base infrastructure - see Minimize base cost of our clusters #235
  5. We should make sure that all our grafana deployments come with a proper setup of the dashboards that will make them useful. https://github.com/yuvipanda/jupyterhub-grafana is a start.
  6. Pull in PROXY_SECRET_KEY from a centralized location, so it's consistent wherever people run deploy scripts from (Document PROXY_SECRET_KEY #116)
  7. Hackathon organizers should be able to control the hub home page better, so they can put content there that makes life easier for their particular set of attendees
  8. NFS Server should be automatically deployed and monitored. Run NFS servers in-cluster #50
  9. Use XFS (or ZFS) for our home directory storage, so we can resize them with ease
@yuvipanda
Copy link
Member

I've made very extensive edits to the main issue. I'll take another pass tomorrow - I don't think it is complete yet. After that, we can spin this out into different issues in this and other repos.

@yuvipanda
Copy link
Member

OK, I think the main issue is complete as far as I'm concerned. I'd love to hear from @khider about the xarray issue they ran into as well, though :)

@khider
Copy link

khider commented Feb 23, 2021

That one was on us. The notebook was written by exporting the values to numpy arrays instead of using native methods in xarray that makes the process efficient.

The data is fairly large, resulting in a resource error.

@yuvipanda
Copy link
Member

Ah, that was helpful to know, @khider!

@CommonClimate
Copy link

Yes, my fault for using xarray like a dummy. We live and learn!

@yuvipanda
Copy link
Member

yuvipanda commented Mar 1, 2021

Things left to do:

@choldgraf
Copy link
Member Author

This debrief is looking great, thanks @yuvipanda for fleshing it out. A few quick questions:

  • For the hub itself, I believe that they'd like to keep using it for the year, we just have not finalized the contract yet to be able to send to them, so let's keep it running for now.
  • Once the contract is ready, we should discuss how to handle the bill...it may not quite fit into the "monthly billing" cycle that we had discussed before...if there's not a way for them to back-pay us then we'll just have to eat the cost.
  • Are there any other issues that need to be created from the items in the debrief? Once we've got issues for everything I think we can close this. WDYT?

@yuvipanda yuvipanda changed the title Debrief - PaleoHackweek Debrief from PaleoHackWeek hub Aug 9, 2021
@yuvipanda
Copy link
Member

This has been incorporated into more of our product processes by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants