Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition LEAP cloud infra to use shared nodes #2209

Closed
consideRatio opened this issue Feb 15, 2023 · 12 comments
Closed

Transition LEAP cloud infra to use shared nodes #2209

consideRatio opened this issue Feb 15, 2023 · 12 comments
Assignees

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 15, 2023

I've together with @jbusecke distilled a support ticket to [improve user startup times] and this this request for information among other things to this technical work items.

Node pool changes

  • There should only be one dask node pool, and it should use n2-highmem-16.
  • The user node pools small/medium/large/huge should be removed
  • A new user node pool (medium) should be added with a n2-highmem-16 machine (16 CPU, 128 GB mem)

New profile list

  • The choices for small-huge is to be replaced with two similar entries with slight differences to start on the 16 core machine.
    • This entry should present the image choices already available in the existing profile list
    • One version of the entry should be provided for the leap-pangeo-base-access github group, and one for leap-pangeo-full-access. They differ in the sense that one provides the node share options (1,2,4 CPU) and the other high (1,2,4,8,16 CPU)
    • The node share options should also set CPU/Memory limits to match the requests. This should also come with a comment that this is to be an exception rather than a default.

Followup of actions

@consideRatio consideRatio self-assigned this Feb 15, 2023
@jbusecke
Copy link
Contributor

Thanks for working on this. Please let me know before any changes are made, so I can make sure that:

  • The new user groups leap-pangeo-base-access and leap-pangeo-full-access are populated (to avoid interruption of service for users
  • I give people some heads up warning to not perform any work during the transition period

@jbusecke
Copy link
Contributor

Quick update: I have created and populated the new teams in our org: https://github.com/orgs/leap-stc/teams/leap-pangeo-users/teams

These are non-overlapping at the moment.

@consideRatio
Copy link
Member Author

@jbusecke I suggest that I work this on Monday morning until mid day, Swedish time UTC+1. Is that okay?

@consideRatio
Copy link
Member Author

consideRatio commented Feb 27, 2023

There is currently three running users so I don't cancel their sessions and start an upgrade. @jbusecke should we aim for next week on monday morning Swedish timezone UTC+1, before you wake up in UTC-6?

@jbusecke
Copy link
Contributor

jbusecke commented Mar 7, 2023

Sorry for the delay. I was really sick all last week and not able to stare at a screen.

I have notified the LEAP community to log out and stop any long running computation before Sun Mar 12, and said that service should be back to normal on Monday after 8AM. Is that correct @consideRatio?

I will make another announcement closer to the weekend. Thank you for working on this!

@consideRatio
Copy link
Member Author

@jbusecke absolutely correct!!

And you have populated the github teams leap-pangeo-base-access leap-pangeo-full-access etc with the users already by then, right?

@jbusecke
Copy link
Contributor

jbusecke commented Mar 9, 2023

I still see some discrepancies between the old/new teams, but I will figure this out on my end.

@consideRatio
Copy link
Member Author

@jbusecke I've now performed the most disruptive maintenance. As part of this, only users of either leap-pangeo-base-access leap-pangeo-full-access have access now as only these teams are provided with server startup choices.

image

In this issue, the part about dask-gateway option remains but the other parts are addressed.

As part of doing maintenance in #2237, I also saw an optimization for a workshop a while back. That optimization, after the workshop when nodes wasn't started before users arrived, was likeley slowing down startup of user servers unless they started the tensorflow image specifically. So I think in this setup now, there are three reasons for faster startup.

  1. Node sharing makes nodes likeley to already be running.
  2. The already running nodes likeley already have the image of relevance downloaded to them.
  3. When a node needs to be started, the user arriving to it won't risk needing to wait for the pulling of an unrelated image.

If you wish for even faster startups and better performance, I suggest that the for-this-maintenance agreed CPU limitation is revised. I've thought about this tricky topic a lot and summarized some of those thoughts in #2228 and in this FIXME note.

@jbusecke
Copy link
Contributor

So far this is working great. Many thanks @consideRatio. I think the startup times as I (and others experience them) are absolutely sufficient at the moment.

In this issue, the part about dask-gateway option remains but the other parts are addressed.

Is there anything that is required from my side to move this change forward?

@consideRatio
Copy link
Member Author

Is there anything that is required from my side to move this change forward?

No I think none!

@jbusecke do you feel strongly about the CPU limits? I want to make sure you have a good user experience, and think limiting CPU can be a significant drawback even for users that requests a lot of CPU to not be limited - this is because they will end up less likeley to fit on a node and then may end up needing to wait for server startup and image pulling etc.

Overall, it seems like a loose / loose / loose in terms of UX / Cost / energy efficiency to me at the moment.

@jbusecke
Copy link
Contributor

@jbusecke do you feel strongly about the CPU limits? I want to make sure you have a good user experience, and think limiting CPU can be a significant drawback even for users that requests a lot of CPU to not be limited - this is because they will end up less likeley to fit on a node and then may end up needing to wait for server startup and image pulling etc.

Overall, it seems like a loose / loose / loose in terms of UX / Cost / energy efficiency to me at the moment.

I think I would like to give people some time to give feedback on their experience before changing things more, but I am certainly open to iterate on this further.

I think that from my perspective the dask gateway refactor is of a higher priority. I am not sure if we should track that in a different issue to keep things neat and enable an ongoing discussion here about the CPU limits?

@consideRatio
Copy link
Member Author

consideRatio commented Mar 16, 2023

I think I would like to give people some time to give feedback on their experience before changing things more, but I am certainly open to iterate on this further.

Okay! Let us know if you want to remove the CPU limit or for example have it to be 4X the requested CPU or similar.

I think that from my perspective the dask gateway refactor is of a higher priority. I am not sure if we should track that in a different issue to keep things neat and enable an ongoing discussion here about the CPU limits?

I'm closing this issue now, and dask-gateway work is now represented by #2364 and #2051. I'm not able to give that priority of my own time at the moment =/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants