-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update helm chart for k8 RP [DET-3542] #882
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions
{{- end }} | ||
{{ end }} | ||
|
||
{{- if .Values.telemetry }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: is there no way to do an and
comparison in one if statement? It's not a big deal but might look cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem like Helm has a and
flow control operator unfortunately
# scheduling multi-GPU tasks. Multi-gpu (distributed training) tasks will be scheduled as | ||
# slotsPerTask / slotsPerNode separate pods (tasks sizes that are not divisible by slotsPerNode | ||
# are never scheduled. | ||
slotsPerNode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what happens if the cluster has different instance types? Is there another way we are making sure each node has the same number of GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we don't do anything to account for it. One thing users could do is set it to the smallest denominator, then we will just have the overhead of running multiple pods per node. I am planning to document this as part of M5, also will expand the doc-string here with this info.
helm/charts/determined/values.yaml
Outdated
masterCpuRequest: "4" | ||
masterMemRequest: "8Gi" | ||
|
||
## Configure the task container defaults. Tasks include trials, commands, tensorboards and more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: more is just notebooks right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also shells, will just write them all out.
## random non-privileged ports, respectively. | ||
taskContainerDefaults: | ||
shmSizeBytes: 4294967296 | ||
# networkMode: bridge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is bridge the only valid value for networkMode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also "host" networking mode, but obv that isn't advisable in k8s.
Description
Added several missing configurations, made the k8 RP configurable, and added CPU and MEM request limits for the master.
Test Plan
Tested manually by deploying on a k8 cluster. Automated testing will be added as part of M4.