Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

Open
consideRatio opened this issue Feb 1, 2023 · 3 comments
Open

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 1, 2023

Currently, the hub and proxy pod requests very little CPU/Memory, but various pods by default have already 100m in requests. This could starve our hub/proxy pod of CPU. I think for the sake of stability, we should grant the hub pod 1 full CPU, and allow them to request memory to an extent making us confident we won't get Evicted/OOMKilled either.

It also seems that the hub/proxy pod request of 128 MB memory isn't covering the need. It would be good to request memory more than we typically use so that we don't risk being evicted or OOMKilled.

image

  prod                        proxy-7f5dbcbd68-zqd22                                 10m (0%)      0 (0%)      64Mi (0%)        1Gi (4%)       94d
  prod                        hub-5db4d78fdc-bwbvv                                   10m (0%)      0 (0%)       128Mi (0%)       2Gi (9%)       60m

This could have been relevant for the incident in #2126, if it wasn't it would be good to rule it out by having these increased requests.

Config in basehub

If we provide a 10m request and other pods on the node has 100m requests and going full throttle - they will get a ten times larger share of CPU than the hub pod. On core nodes with 4 CPU it means that our hub pod would only get 0.4 CPU.

I understand it as the hub pod can benefit of up to 1 full CPU from time to time, but I'm a bit confused about it. I recall a grafana dashboard I've seen in the past presented metrics in a way that fails to capture the peaks properly unless zoomed in.

@yuvipanda I think we could put 50m or 100m in requests here for the hub pod to reduce the risk of getting throttled before 1 CPU if in competition with other pods. What do you think?

Hub pod

resources:
requests:
# Very small unit, since we don't want any CPU guarantees
cpu: 0.01
memory: 128Mi
limits:
memory: 2Gi

Proxy pod

resources:
requests:
# FIXME: We want no guarantees here!!!
# This is lowest possible value
cpu: 0.01
memory: 64Mi
limits:
memory: 1Gi

@yuvipanda
Copy link
Member

I think looking at observed usage metrics and setting appropriate requests and limits is a good idea! We don't want them to be too high (especially in shared clusters) as that might increase overall cost, but we already have data for this so I leave it to you to figure out a decent number and get it there! I agree that the current situation has to change

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Feb 3, 2023
Ref https://2i2c.freshdesk.com/a/tickets/414

We should figure out better defaults in
2i2c-org#2127,
but as LEAP is getting close to publication on some stuff,
this will help us with stabilizing the infrastructure.
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Feb 3, 2023
Ref https://2i2c.freshdesk.com/a/tickets/414

We should figure out better defaults in
2i2c-org#2127,
but as LEAP is getting close to publication on some stuff,
this will help us with stabilizing the infrastructure.
@consideRatio
Copy link
Member Author

consideRatio commented Feb 10, 2023

I'm looking at openscapes hub pod, and I observe that the hub responsiveness can peak.

I wonder if that metric including all requests, and if some requests are slow while others fast. The 50th percentile stays low but the 99th is often in seconds.

image
image

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this issue Feb 13, 2023
It's a long running connection kept open, serving progressbar
responses via
[EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource).
So it can't be treated as a regular HTTP request / response.

Getting rid of this unmasks more real problems in hub response
latency by removing this noise.

Ref 2i2c-org/infrastructure#2127 (comment)
@yuvipanda
Copy link
Member

@consideRatio good catch, I opened jupyterhub/grafana-dashboards#59 as a 'fix' on grafana

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this issue Jul 25, 2024
It's a long running connection kept open, serving progressbar
responses via
[EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource).
So it can't be treated as a regular HTTP request / response.

Getting rid of this unmasks more real problems in hub response
latency by removing this noise.

Ref 2i2c-org/infrastructure#2127 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

2 participants