Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up node spin-up with placeholders #643

Closed
2 tasks
choldgraf opened this issue Aug 31, 2021 · 7 comments
Closed
2 tasks

Speed up node spin-up with placeholders #643

choldgraf opened this issue Aug 31, 2021 · 7 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Aug 31, 2021

Description

We should speed up the time it takes to spin-up our nodes by using placeholders.

Speeding up the nodes would make our hubs perform better whenever there were spikes in activity, or when a user triggers a scale-up event in general. It would help our hubs feel speedier.

Guide for implementation

➡️ Great info about user placeholders in the z2jh docs here and discourse

Performance

To understand whether this would make a bit impact on performance, we could try analyzing the log files from the old U.Toronto hub and compare it with the new one.

Implementing across cloud providers

It's unclear whether this would behave the same way across the major cloud providers. Here's what we know about each:

GKE: According to this note in the z2jh docs this should work on gke at least.

Azure: The original UToronto cluster runs on Azure, and had been using user placeholders ➡️ utoronto-2i2c/jupyterhub-deploy@1c7fa04 (that's not the case anymore since we've migrated it to the pilot-hubs infra as part of #638)

AWS: Not sure if/how this works

Updates and tasks

  • Estimate the performance improvement we'd get with user placeholders
  • Decide if / how we should implement this across the cloud providers

Updates

No response

@GeorgianaElena
Copy link
Member

GeorgianaElena commented Jan 4, 2022

(Adding some additional info to this as I believe it doesn't need it's own new issue. Sorry it's a bit unpolished.)

EDIT FROM CHRIS: added to the top comment

@choldgraf
Copy link
Member Author

Thanks @GeorgianaElena for providing this helpful context! I've taken your comment and incorporated it into the top comment, so that we can keep all of the information in one place. I hope that's OK!

I seem to remember a conversation with @yuvipanda and @consideRatio where they said that the user placeholders were not working as well as they thought they were, but I don't know if that was unique to one deployment, or one cloud provider, etc.

@consideRatio
Copy link
Member

Me and Yuvi have deliberated a lot about this and I'd want to avoid rehashing technical details motivating this suggestion within this issue, but the summarized suggestions are:

  • If several user placeholder pods are configured to be used, it is probably better to configure a fewer number of user placeholder pods to run but each having a resource requests matching the least powerful node available for users in the k8s cluster.
  • If you simply want to always have at least one node running, configure the node pool to have 1-X in autoscaling range and don't use placeholder pods.

@damianavila
Copy link
Contributor

IIRC there were also discussions about node placeholders instead of user placeholders.

@GeorgianaElena
Copy link
Member

Update:

UTotoronto folks opened a ticket about cluster scale-up duration that I believe caused them spawn timeouts https://2i2c.freshdesk.com/a/tickets/79. According to the report, this wasn't an isolated event.

Since the utoronto hub is pretty used by users, I believe this will continue to happen. So, I believe we should prioritize this task.

@consideRatio
Copy link
Member

I added a note about it. I think we have a quite distinct fingerprint of the issue:

  1. A user is scheduled to a node, but the node is considered full with "too many pods"
  2. Being scheduled to a node - autoscaling the node pool to add new nodes is not done - the scheduler did schedule the pod to a node - it didn't leave it in a pending state in need of a node!

I responded about this in the ticket.

@GeorgianaElena
Copy link
Member

Thanks a lot @consideRatio! Since it's a different type of beast and not at all what I thought at first, I'll open a separate issue then to discuss about it more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants