Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting best practises for resource allocation when pre-warming a hub for an event #1594

Open
GeorgianaElena opened this issue Aug 4, 2022 · 6 comments
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt

Comments

@GeorgianaElena
Copy link
Member

GeorgianaElena commented Aug 4, 2022

Update 2023

We are now using node sharing for a more effective resource allocation. See #2121 for more details.

With this resource allocation, we can empower admins to pre-warm the hubs by themselves before an event, by carefully choosing the machine types (depending on their expected usage and number of users).

Specifically in the context of an event, we should document current resource allocation practices:

  • engineer-facing, in https://infrastructure.2i2c.org/, that would help engineers get all the relevant info about the event from community reps and make relevant changes to the infrastructure before the event (when relevant).
  • community facing, that would describe how pre-warming is possible by having an admin start a certain server, that brings up an entire node before the event.

Context

I believe it would be super useful to define some 2i2c-specific best practices for when we're pre-warming the hubs for events and how we're supposed to be choosing:

  • machine type
  • autoscaler limits
  • singleuser server memory and cpu limits.

From what I'm seeing from last year events it looks like we used to follow an approach of <10 pods per node and a high number for the autoscaler. But I feel like recently I've seen recommendations to use more powerful machines that fit more pods and fewer nodes in the nodepool so that cpu and mem are more efficiently used.

Proposal

I know there are pros and cons to each of these approaches, but what are some key factors that might impose one approach instead of the other? Or, more specifically, what could be the questions we could ask the communities about their workflow in order to make a more informed decision?

Updates and actions

@GeorgianaElena
Copy link
Member Author

Also sharing @consideRatio's answer from slack too for more context:

" Decisions on what nodes to use could reasonably be delegated, but we can come with recommendations as well.
I have no clear recipe on how I've made decisions on this historically. But I think:

  • putting 10-100 users on the same node is reasonable
  • having 0-10 or 1-10 nodes is reasonable

Goals:

  • Avoid startup times
  • Improve CPU and memory efficiency by being multiple users on a node
  • Avoid having too large nodes as they can be too big for the few users that are active during low time of activity
  • More goals...

Overall, very complicated topic. One can optimize for so many things and depending on resource request and expected user activity, one may opt for very different things in hardware.
"

@GeorgianaElena GeorgianaElena changed the title Best practises for resource allocation when pre-warming a hub for an event Documenting best practises for resource allocation when pre-warming a hub for an event Aug 4, 2022
@consideRatio
Copy link
Member

So I'm setting up a hub for an event in #2049, but I think the hub experience can be improved by using larger nodes that users share instead of allocating individual nodes for users.

I think we should provide recommended setups and clarify the benefits for them of sharing a few larger nodes. Less cost, better UX with regards to the time it takes to startup etc.

@consideRatio
Copy link
Member

I opened #2121 which I think relates greatly to this.

@GeorgianaElena GeorgianaElena added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 19, 2023
@consideRatio
Copy link
Member

consideRatio commented Oct 19, 2023

I think of this as to a large extent blocked by #3030, and that we also need a feature to not always optimize to use the smallest available node for a resource allocation request. I opened #3293 to track that.

I think the information we need from community reps is the extpected amount of users and what resource allocation requests they plan to use - that would allow us to optimize for the event quite well.

@GeorgianaElena
Copy link
Member Author

Thanks @consideRatio! ✨

think of this as to a large extent blocked by #3030, and that we also need a feature to not always optimize to use the smallest available node for a resource allocation request. I opened #3293 to track that.

I believe hub share options during every day usage vs share options during events tend differ. Specifically for events, we should have a written policy that we apply or a set of guidelines that we check, to decide if we need to make adjustments to the infrastructure.

So, I don't think we should block writing these docs/guidelines about events, on creating an utility that would guide us how to setup the choices for everyday usage based on a chosen strategy, which is what I understand #3030 is trying to achieve.

Also, I believe that as much as we are trying to make #3030 perfect and cover all cases and strategies, as hard it will be to move it forward, and until then we are in a weird state where:

  • all the guidance an engineer has when setting up a new hub or when making a hub ready for an event, is buried in issues, comments, and slack messages. I believe this is toily and stressful.
  • we haven't updated our community-facing docs about "pre-warming" for events in https://docs.2i2c.org/community/events/#events-pre-initialized, to explain how the pre-warm can now be leveraged by the communities themselves depending on their usage.

I think the information we need from community reps is the extpected amount of users and what resource allocation requests they plan to use - that would allow us to optimize for the event quite well.

Yes, I've noticed community reps actually using the template in https://docs.2i2c.org/community/events/#notify-the-2i2c-team-about-the-event, which I think is great, but it means we should make sure we keep that info updated (this is the place that I am thinking when I say communty-facing docs"

@GeorgianaElena
Copy link
Member Author

Ah, I've just noticed #3293! That makes sense.
But would love to have the words that guide that development written down before implementing it, or at least not block writing them by implementing the utility.

Motivation would be:

  • without the words, only people that could implement the utility efficiently would be @consideRatio and @yuvipanda who've invested the most time and effort into improving this system.
  • without the words, the other engineers must dig into long discussions on issues, trying to understand what is the convention we are following for events, or just do nothing because it's not in the docs, in which case, the responsibility of observing an issue with it lies also on the shoulders on @consideRatio and @yuvipanda.

I really believe that having such docs will reduce stress, fatigue and load on the engineer team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt
Projects
No open projects
Status: Needs Shaping / Refinement
Status: Committed 👍
Development

No branches or pull requests

2 participants