Speed up node spin-up with placeholders #643

choldgraf · 2021-08-31T04:21:49Z

Description

We should speed up the time it takes to spin-up our nodes by using placeholders.

Speeding up the nodes would make our hubs perform better whenever there were spikes in activity, or when a user triggers a scale-up event in general. It would help our hubs feel speedier.

Guide for implementation

➡️ Great info about user placeholders in the z2jh docs here and discourse

Performance

To understand whether this would make a bit impact on performance, we could try analyzing the log files from the old U.Toronto hub and compare it with the new one.

Implementing across cloud providers

It's unclear whether this would behave the same way across the major cloud providers. Here's what we know about each:

GKE: According to this note in the z2jh docs this should work on gke at least.

Azure: The original UToronto cluster runs on Azure, and had been using user placeholders ➡️ utoronto-2i2c/jupyterhub-deploy@1c7fa04 (that's not the case anymore since we've migrated it to the pilot-hubs infra as part of #638)

AWS: Not sure if/how this works

Updates and tasks

Estimate the performance improvement we'd get with user placeholders
Decide if / how we should implement this across the cloud providers

Updates

No response

The text was updated successfully, but these errors were encountered:

GeorgianaElena · 2022-01-04T10:50:43Z

(Adding some additional info to this as I believe it doesn't need it's own new issue. Sorry it's a bit unpolished.)

EDIT FROM CHRIS: added to the top comment

choldgraf · 2022-01-04T22:18:39Z

Thanks @GeorgianaElena for providing this helpful context! I've taken your comment and incorporated it into the top comment, so that we can keep all of the information in one place. I hope that's OK!

I seem to remember a conversation with @yuvipanda and @consideRatio where they said that the user placeholders were not working as well as they thought they were, but I don't know if that was unique to one deployment, or one cloud provider, etc.

consideRatio · 2022-01-08T19:42:12Z

Me and Yuvi have deliberated a lot about this and I'd want to avoid rehashing technical details motivating this suggestion within this issue, but the summarized suggestions are:

If several user placeholder pods are configured to be used, it is probably better to configure a fewer number of user placeholder pods to run but each having a resource requests matching the least powerful node available for users in the k8s cluster.
If you simply want to always have at least one node running, configure the node pool to have 1-X in autoscaling range and don't use placeholder pods.

damianavila · 2022-01-11T14:42:11Z

IIRC there were also discussions about node placeholders instead of user placeholders.

GeorgianaElena · 2022-02-17T10:12:38Z

Update:

UTotoronto folks opened a ticket about cluster scale-up duration that I believe caused them spawn timeouts https://2i2c.freshdesk.com/a/tickets/79. According to the report, this wasn't an isolated event.

Since the utoronto hub is pretty used by users, I believe this will continue to happen. So, I believe we should prioritize this task.

consideRatio · 2022-02-17T11:10:42Z

I added a note about it. I think we have a quite distinct fingerprint of the issue:

A user is scheduled to a node, but the node is considered full with "too many pods"
Being scheduled to a node - autoscaling the node pool to add new nodes is not done - the scheduler did schedule the pod to a node - it didn't leave it in a pending state in need of a node!

I responded about this in the ticket.

GeorgianaElena · 2022-02-17T12:02:45Z

Thanks a lot @consideRatio! Since it's a different type of beast and not at all what I thought at first, I'll open a separate issue then to discuss about it more.

choldgraf added 🏷️ optimization labels Aug 31, 2021

choldgraf mentioned this issue Aug 31, 2021

Major development Managed Hub Service v1 #610

Closed

choldgraf added the impact: medium label Aug 31, 2021

choldgraf removed the need discussion label Sep 6, 2021

choldgraf removed the impact: med label Oct 28, 2021

GeorgianaElena mentioned this issue Jan 4, 2022

Migrate UToronto hub to our pilot-hubs repository #638

Closed

15 tasks

GeorgianaElena mentioned this issue Feb 17, 2022

[UToronto] Pod scheduled on full node and spawn failure #1004

Closed

1 task

choldgraf removed the 🏷️ optimization label Sep 16, 2022

yuvipanda closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up node spin-up with placeholders #643

Speed up node spin-up with placeholders #643

choldgraf commented Aug 31, 2021 •

edited

Loading

GeorgianaElena commented Jan 4, 2022 •

edited by choldgraf

Loading

choldgraf commented Jan 4, 2022

consideRatio commented Jan 8, 2022

damianavila commented Jan 11, 2022

GeorgianaElena commented Feb 17, 2022

consideRatio commented Feb 17, 2022

GeorgianaElena commented Feb 17, 2022

Speed up node spin-up with placeholders #643

Speed up node spin-up with placeholders #643

Comments

choldgraf commented Aug 31, 2021 • edited Loading

Description

Guide for implementation

Updates and tasks

Updates

GeorgianaElena commented Jan 4, 2022 • edited by choldgraf Loading

choldgraf commented Jan 4, 2022

consideRatio commented Jan 8, 2022

damianavila commented Jan 11, 2022

GeorgianaElena commented Feb 17, 2022

Update:

consideRatio commented Feb 17, 2022

GeorgianaElena commented Feb 17, 2022

choldgraf commented Aug 31, 2021 •

edited

Loading

GeorgianaElena commented Jan 4, 2022 •

edited by choldgraf

Loading