Documenting best practises for resource allocation when pre-warming a hub for an event #1594

GeorgianaElena · 2022-08-04T11:56:59Z

Update 2023

We are now using node sharing for a more effective resource allocation. See #2121 for more details.

With this resource allocation, we can empower admins to pre-warm the hubs by themselves before an event, by carefully choosing the machine types (depending on their expected usage and number of users).

Specifically in the context of an event, we should document current resource allocation practices:

engineer-facing, in https://infrastructure.2i2c.org/, that would help engineers get all the relevant info about the event from community reps and make relevant changes to the infrastructure before the event (when relevant).
community facing, that would describe how pre-warming is possible by having an admin start a certain server, that brings up an entire node before the event.

Context

I believe it would be super useful to define some 2i2c-specific best practices for when we're pre-warming the hubs for events and how we're supposed to be choosing:

machine type
autoscaler limits
singleuser server memory and cpu limits.

From what I'm seeing from last year events it looks like we used to follow an approach of <10 pods per node and a high number for the autoscaler. But I feel like recently I've seen recommendations to use more powerful machines that fit more pods and fewer nodes in the nodepool so that cpu and mem are more efficiently used.

Proposal

I know there are pros and cons to each of these approaches, but what are some key factors that might impose one approach instead of the other? Or, more specifically, what could be the questions we could ask the communities about their workflow in order to make a more informed decision?

Updates and actions

GeorgianaElena · 2022-08-04T12:00:16Z

Also sharing @consideRatio's answer from slack too for more context:

" Decisions on what nodes to use could reasonably be delegated, but we can come with recommendations as well.
I have no clear recipe on how I've made decisions on this historically. But I think:

putting 10-100 users on the same node is reasonable
having 0-10 or 1-10 nodes is reasonable

Goals:

Avoid startup times
Improve CPU and memory efficiency by being multiple users on a node
Avoid having too large nodes as they can be too big for the few users that are active during low time of activity
More goals...

Overall, very complicated topic. One can optimize for so many things and depending on resource request and expected user activity, one may opt for very different things in hardware.
"

consideRatio · 2023-01-27T14:45:41Z

So I'm setting up a hub for an event in #2049, but I think the hub experience can be improved by using larger nodes that users share instead of allocating individual nodes for users.

I think we should provide recommended setups and clarify the benefits for them of sharing a few larger nodes. Less cost, better UX with regards to the time it takes to startup etc.

consideRatio · 2023-01-31T23:41:22Z

I opened #2121 which I think relates greatly to this.

consideRatio · 2023-10-19T11:01:03Z

I think of this as to a large extent blocked by #3030, and that we also need a feature to not always optimize to use the smallest available node for a resource allocation request. I opened #3293 to track that.

I think the information we need from community reps is the extpected amount of users and what resource allocation requests they plan to use - that would allow us to optimize for the event quite well.

GeorgianaElena · 2023-10-20T10:22:41Z

Thanks @consideRatio! ✨

think of this as to a large extent blocked by #3030, and that we also need a feature to not always optimize to use the smallest available node for a resource allocation request. I opened #3293 to track that.

I believe hub share options during every day usage vs share options during events tend differ. Specifically for events, we should have a written policy that we apply or a set of guidelines that we check, to decide if we need to make adjustments to the infrastructure.

So, I don't think we should block writing these docs/guidelines about events, on creating an utility that would guide us how to setup the choices for everyday usage based on a chosen strategy, which is what I understand #3030 is trying to achieve.

Also, I believe that as much as we are trying to make #3030 perfect and cover all cases and strategies, as hard it will be to move it forward, and until then we are in a weird state where:

all the guidance an engineer has when setting up a new hub or when making a hub ready for an event, is buried in issues, comments, and slack messages. I believe this is toily and stressful.
we haven't updated our community-facing docs about "pre-warming" for events in https://docs.2i2c.org/community/events/#events-pre-initialized, to explain how the pre-warm can now be leveraged by the communities themselves depending on their usage.

I think the information we need from community reps is the extpected amount of users and what resource allocation requests they plan to use - that would allow us to optimize for the event quite well.

Yes, I've noticed community reps actually using the template in https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event, which I think is great, but it means we should make sure we keep that info updated (this is the place that I am thinking when I say communty-facing docs"

GeorgianaElena · 2023-10-20T10:32:13Z

Ah, I've just noticed #3293! That makes sense.
But would love to have the words that guide that development written down before implementing it, or at least not block writing them by implementing the utility.

Motivation would be:

without the words, only people that could implement the utility efficiently would be @consideRatio and @yuvipanda who've invested the most time and effort into improving this system.
without the words, the other engineers must dig into long discussions on issues, trying to understand what is the convention we are following for events, or just do nothing because it's not in the docs, in which case, the responsibility of observing an issue with it lies also on the shoulders on @consideRatio and @yuvipanda.

I really believe that having such docs will reduce stress, fatigue and load on the engineer team.

GeorgianaElena changed the title ~~Best practises for resource allocation when pre-warming a hub for an event~~ Documenting best practises for resource allocation when pre-warming a hub for an event Aug 4, 2022

GeorgianaElena mentioned this issue Aug 18, 2022

[DUPE] Document how to scale up nodes/node pools for an event #1628

Closed

GeorgianaElena added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 19, 2023

GeorgianaElena mentioned this issue Dec 13, 2023

[documentation] Start an event preparation guide #3522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

GeorgianaElena commented Aug 4, 2022 •

edited

Loading

GeorgianaElena commented Aug 4, 2022

consideRatio commented Jan 27, 2023

consideRatio commented Jan 31, 2023

consideRatio commented Oct 19, 2023 •

edited

Loading

GeorgianaElena commented Oct 20, 2023

GeorgianaElena commented Oct 20, 2023

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

Comments

GeorgianaElena commented Aug 4, 2022 • edited Loading

Update 2023

Context

Proposal

Updates and actions

GeorgianaElena commented Aug 4, 2022

consideRatio commented Jan 27, 2023

consideRatio commented Jan 31, 2023

consideRatio commented Oct 19, 2023 • edited Loading

GeorgianaElena commented Oct 20, 2023

GeorgianaElena commented Oct 20, 2023

GeorgianaElena commented Aug 4, 2022 •

edited

Loading

consideRatio commented Oct 19, 2023 •

edited

Loading