Debrief from PaleoHackWeek hub #260

choldgraf · 2021-02-21T18:08:54Z

Summary

We ran a hub for the PaleoHackWeek hackathon, in partnership with @CommonClimate and @khider. The event went great overall, but there are some things that we can learn from it!

cc @yuvipanda and @GeorgianaElena - perhaps we can discuss the summaries below and then decide how we'd like to spin off issues for these? Please provide any extra edits that clarify things below! and @CommonClimate + @khider please feel free to provide any clarifications of your own, or new ideas as they came up!

What went well

Paleohack organizers were able to control which users can log onto the system without involvement from 2i2c organizers.
@khider was already familiar tools like xarray, netcdf4, docker, etc - so she was able to help debug some issues that surfaced once the notebooks actually started running.
For the most part, everything went really smooth!
We were running on a brand new Google Cloud project, and we were able to setup the cluster with minimal manual work on our part. There were also no cloud quota issues, which was great.

What went poorly

We did a deploy during the hackathon to increase user resources (Increase RAM for paleohack2021 #256). However, because of a misunderstanding about how much resources were actually on each node, this caused new user pods to stop spawning - we were requesting more memory than was available!
A deploy from a laptop to quickly fix the above issue with another PR (paleohack2021: Change limits #257) triggered a proxy pod restart, since we have a different PROXY_SECRET_TOKEN for local machines (see Document PROXY_SECRET_KEY #116). Merging the PR triggered another proxy pod restart, causing the hub to basically stop working while the proxy restarted and was repopulated with appropriate routes.
We had set a default limit of 10 user nodes as maximum for autoscaling. This was hit pretty early on in the hackathon, since they had more users than would fit on 10 nodes. Had to be bumped manually.
The home page for the hackathon hub wasn't as useful as it could have been. No references to nbgitpuller, where to go to support, etc. Was just our generic hub home page template, which is more focused on educational hubs.

Speed bumps

The relationship between the user environment and content in it was not clear. mybinder.org is the common reference point for most people, and it doesn't really separate these two. We were building the image off this repo, and the content was in that repo as well. Updating content didn't require any work from 2i2c staff, but updating the environment did. Updating content required a re-click of the nbgitpuller link, while updating user's environment required they start / stop their server. This was very confusing!
Amount of resources (RAM, CPU) required was hard to estimate, since people don't usually think of this when running on their laptop. There isn't a very clear way to do this, so it was trial and error.
Our terraform code to deploy node the cluster does not allow for creating a new node pool without deleting the current user node pool. Deleting the current user node pool will disrupt currently running users. We hacked around this by manually adding another node pool to the terraform code temporarily.
We didn't have a smooth process for support from 2i2c staff during the hackathon. This ended up happening on the 2i2c slack via private messages ad-hoc. We should have a better process for this.
During first login, new users had to wait for a few minutes as new nodes were spun up for them. This was made a bit better by enabling user placeholders to keep a 2 node headroom, but something that can be controlled by the admins (like this) would have made this much easier.
We didn't have a super clear idea of how to get the cheapest 'base cost' - core node size and configuration. We ended up with e2-highcpu-4, need to investigate if that is the right thing.
We setup the NFS server manually. We shouldn't have had to.

Where we got lucky

With regular, non HA 'zonal' clusters, modifying the cluster (including adding new node pools) makes the kubernetes master unavailable for a minute or more. This isn't a problem with regional clusters. We had to change user node size just before the hackathon started, and it worked out ok. Otherwise, new users would not have been able to log in for the duration of this operation - which is often less than a minute, but non deterministcally sometimes can be much larger.
Some users were getting a 'Server Error' when trying to view contents of directories. This was intermittent - a refresh often fixed the issue. Nothing was in the logs, and we never found the root cause of this. Could have been a lot worse!
It was unclear what the right size for the user node should be. It's a trade-off between cost, maximum possible resources for a user (which is limited by node size), memory / CPU ratio and autoscale performance. We ended up using n1-highcpu-8 CPUs, which were ok for the determined 2 CPU, 2GB RAM per user resource requests. But this had to be upped during the course of the hackathon, and we got lucky that each user didn't need more than 4G of RAM.
We had a prometheus / grafana setup, but we never actually used it. The grafana wasn't set up properly since we don't have an easy way to do that. The prometheus was also probably overresourced for this setup.
We 'guessed' how much home directory storage was needed, and picked 100G of standard disk with a very small server. This might've not been enough, and could've been the cause of the 'Server Error' some folks experienced (from looking at stackdriver metrics). Could also have been that we didn't get enough iops because our disk was too small. It was also ext4, and resizing in xfs seems much more supported. We got lucky none of these became a real problem.

Action items

Process improvements

Process for communication and escalation during a hackathon between 2i2c and organizers
Process for home page customization so it is clear for first time users what they should be doing
Recommend we use regional clusters for all hackathons, as any downtime is unacceptable
Process for deciding on initial set of requirements - CPU, RAM, Disk Space - before the hackathon. We can be nimble about modifying them once users start using it.

Documentation improvements

How nbgitpuller and image building work, so people understand what is happening when
How to test the built image locally with appropriate resources for iteration
How to measure memory & CPU usage locally so they can inform how much cloud resources we provide
How to setup autobuild on quay.io for your image
How manual deploys work, so others can do that in a pinch if necessary (Document manual deploys with deploy.py #113)
Document trade-offs between user node size, resource requests, density and cost so we can pick what fits the particular project better.

Technical improvements

Terraform should be more flexible wrt user pools, so we can add more without having to delete the existing one.
Admins should be able to control user placeholder pods, memory & cpu usage so they can modify that without needing 2i2c staff intervention
We moved the paleohack entry in hubs.yaml (Move paleohack2021 to top of list #246) since full deploys take time. We should make these deploys be much faster!
Figure out the cheapest way to run our base infrastructure - see Minimize base cost of our clusters #235
We should make sure that all our grafana deployments come with a proper setup of the dashboards that will make them useful. https://github.com/yuvipanda/jupyterhub-grafana is a start.
Pull in PROXY_SECRET_KEY from a centralized location, so it's consistent wherever people run deploy scripts from (Document PROXY_SECRET_KEY #116)
Hackathon organizers should be able to control the hub home page better, so they can put content there that makes life easier for their particular set of attendees
NFS Server should be automatically deployed and monitored. Run NFS servers in-cluster #50
Use XFS (or ZFS) for our home directory storage, so we can resize them with ease

The text was updated successfully, but these errors were encountered:

yuvipanda · 2021-02-23T01:19:52Z

I've made very extensive edits to the main issue. I'll take another pass tomorrow - I don't think it is complete yet. After that, we can spin this out into different issues in this and other repos.

yuvipanda · 2021-02-23T12:02:15Z

OK, I think the main issue is complete as far as I'm concerned. I'd love to hear from @khider about the xarray issue they ran into as well, though :)

khider · 2021-02-23T13:15:01Z

That one was on us. The notebook was written by exporting the values to numpy arrays instead of using native methods in xarray that makes the process efficient.

The data is fairly large, resulting in a resource error.

yuvipanda · 2021-02-23T20:15:45Z

Ah, that was helpful to know, @khider!

CommonClimate · 2021-02-23T21:28:12Z

Yes, my fault for using xarray like a dummy. We live and learn!

yuvipanda · 2021-03-01T20:05:28Z

Things left to do:

Figure out what to do with the bill - https://console.cloud.google.com/billing/011901-F8160A-68DED7/reports;chartType=STACKED_BAR;timeRange=LAST_MONTH;projects=hackathon-2i2c-project-alpha?project=hackathon-2i2c-project-alpha
Do we wind this down? Keep it running in an even-lower-powered mode?

choldgraf · 2021-03-05T22:58:48Z

This debrief is looking great, thanks @yuvipanda for fleshing it out. A few quick questions:

For the hub itself, I believe that they'd like to keep using it for the year, we just have not finalized the contract yet to be able to send to them, so let's keep it running for now.
Once the contract is ready, we should discuss how to handle the bill...it may not quite fit into the "monthly billing" cycle that we had discussed before...if there's not a way for them to back-pay us then we'll just have to eat the cost.
Are there any other issues that need to be created from the items in the debrief? Once we've got issues for everything I think we can close this. WDYT?

yuvipanda · 2024-07-01T23:17:30Z

This has been incorporated into more of our product processes by now.

choldgraf added the post-mortem label Feb 21, 2021

yuvipanda mentioned this issue Mar 1, 2021

Setup a hub for Julien Emile-Geay / USC Hackathon #172

Closed

4 tasks

yuvipanda mentioned this issue May 14, 2021

Move paleohack hub to pilot hubs cluster again #409

Closed

5 tasks

yuvipanda changed the title ~~Debrief - PaleoHackweek~~ Debrief from PaleoHackWeek hub Aug 9, 2021

choldgraf removed the type: Hub Incident label Sep 16, 2022

yuvipanda closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debrief from PaleoHackWeek hub #260

Debrief from PaleoHackWeek hub #260

choldgraf commented Feb 21, 2021 •

edited by yuvipanda

Loading

yuvipanda commented Feb 23, 2021

yuvipanda commented Feb 23, 2021

khider commented Feb 23, 2021

yuvipanda commented Feb 23, 2021

CommonClimate commented Feb 23, 2021

yuvipanda commented Mar 1, 2021 •

edited

Loading

choldgraf commented Mar 5, 2021

yuvipanda commented Jul 1, 2024

Debrief from PaleoHackWeek hub #260

Debrief from PaleoHackWeek hub #260

Comments

choldgraf commented Feb 21, 2021 • edited by yuvipanda Loading

Summary

What went well

What went poorly

Speed bumps

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements

yuvipanda commented Feb 23, 2021

yuvipanda commented Feb 23, 2021

khider commented Feb 23, 2021

yuvipanda commented Feb 23, 2021

CommonClimate commented Feb 23, 2021

yuvipanda commented Mar 1, 2021 • edited Loading

choldgraf commented Mar 5, 2021

yuvipanda commented Jul 1, 2024

choldgraf commented Feb 21, 2021 •

edited by yuvipanda

Loading

yuvipanda commented Mar 1, 2021 •

edited

Loading