Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update helm chart for k8 RP [DET-3542] #882

Merged
merged 4 commits into from
Jul 20, 2020

Conversation

aaron276h
Copy link
Contributor

Description

Added several missing configurations, made the k8 RP configurable, and added CPU and MEM request limits for the master.

Test Plan

Tested manually by deploying on a k8 cluster. Automated testing will be added as part of M4.

Copy link
Contributor

@sidneyw sidneyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions

{{- end }}
{{ end }}

{{- if .Values.telemetry }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: is there no way to do an and comparison in one if statement? It's not a big deal but might look cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem like Helm has a and flow control operator unfortunately

# scheduling multi-GPU tasks. Multi-gpu (distributed training) tasks will be scheduled as
# slotsPerTask / slotsPerNode separate pods (tasks sizes that are not divisible by slotsPerNode
# are never scheduled.
slotsPerNode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what happens if the cluster has different instance types? Is there another way we are making sure each node has the same number of GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we don't do anything to account for it. One thing users could do is set it to the smallest denominator, then we will just have the overhead of running multiple pods per node. I am planning to document this as part of M5, also will expand the doc-string here with this info.

masterCpuRequest: "4"
masterMemRequest: "8Gi"

## Configure the task container defaults. Tasks include trials, commands, tensorboards and more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: more is just notebooks right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also shells, will just write them all out.

## random non-privileged ports, respectively.
taskContainerDefaults:
shmSizeBytes: 4294967296
# networkMode: bridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is bridge the only valid value for networkMode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also "host" networking mode, but obv that isn't advisable in k8s.

@sidneyw sidneyw assigned aaron276h and unassigned sidneyw Jul 20, 2020
@aaron276h aaron276h assigned sidneyw and unassigned aaron276h Jul 20, 2020
@sidneyw sidneyw assigned aaron276h and unassigned sidneyw Jul 20, 2020
@aaron276h aaron276h merged commit 56df7d2 into determined-ai:master Jul 20, 2020
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Jun 23, 2023
@dannysauer dannysauer added this to the 0.12.12 milestone Feb 6, 2024
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants