Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUTO-BACKPORT 10138] feat: determined_master_host and friends helm support, better defaults #10159

Merged
merged 1 commit into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .circleci/devcluster/multi-k8s.devcluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: /tmp/defaultrm-kubeconf
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
internal_task_gateway:
gateway_name: contour
Expand All @@ -60,7 +60,7 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: /tmp/additionalrm-kubeconf
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
resource_pools:
- pool_name: additional_pool
2 changes: 1 addition & 1 deletion .circleci/devcluster/single-k8s.devcluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: ~/.kube/config
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
9 changes: 9 additions & 0 deletions docs/release-notes/add-host-port-scheme-to-helm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:orphan:

**New Features**

- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and
``determined_master_scheme``. These control how tasks address the Determined API server and are
useful when installations span multiple Kubernetes clusters or there are proxies in between tasks
and the master. Also, ``determined_master_host`` now defaults to the service host,
``<det_namespace>.<det_service_name>.svc.cluster.local``, instead of the service IP.
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ the same as the “cluster name” for a given cluster.

If an additional resource manager needs to connect to the Determined master through a gateway
requiring TLS, ``resource_manager.determined_master_scheme`` should be set to ``https``. If
``resource_manager.determined_master_scheme`` is not set ``determined_master_ip`` will assume
``resource_manager.determined_master_scheme`` is not set ``determined_master_host`` will assume
``https`` if the master is terminating TLS and ``http`` otherwise.

*******
Expand Down
15 changes: 15 additions & 0 deletions helm/charts/determined/templates/master-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,21 @@ stringData:
fluent:
{{- toYaml .Values.fluent | nindent 8}}
{{- end }}
{{- if .Values.determinedMasterHost }}
{{- if .Values.determinedMasterScheme }}
determined_master_scheme: {{ .Values.determinedMasterScheme | quote }}
{{- else if .Values.tlsSecret }}
determined_master_scheme: "https"
{{- else }}
determined_master_scheme: "http"
{{- end }}
determined_master_host: {{ .Values.determinedMasterHost | quote }}
{{- if .Values.determinedMasterPort }}
determined_master_port: {{ .Values.determinedMasterPort }}
{{- else }}
determined_master_port: {{ .Values.masterPort }}
{{- end }}
{{- end }}

default_aux_resource_pool: {{.Values.defaultAuxResourcePool}}
default_compute_resource_pool: {{.Values.defaultComputeResourcePool}}
Expand Down
12 changes: 11 additions & 1 deletion helm/charts/determined/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,16 @@ resourcePools:
## Configure the initial user password for the cluster
# initialUserPassword:

# determinedMasterHost configures the hostname that tasks launched by the primary resource manager use when
# communicating with our API server. This is useful when installations span multiple Kubernetes clusters and when there
# are proxies in between tasks and the master. It defaults to `<det_namespace>.<det_service_name>.svc.cluster.local`.
# determinedMasterHost:
# determinedMasterPort configures the port for the host above. It defaults to `masterPort` specified elsewhere. Must
# be used with determinedMasterHost or it is ineffective.
# determinedMasterPort:
# determinedMasterScheme configures the scheme for the host and port above. It defaults to `https` if our API server
# is deployed with TLS, else `http`. Must be used with determinedMasterHost or it is ineffective.
# determinedMasterScheme:
# additional_resource_managers:
# - resource_manager:
# type: kubernetes
Expand All @@ -421,7 +431,7 @@ resourcePools:
# default_namespace: default
# kubeconfig_secret_name: additionalrm
# kubeconfig_secret_value: config
# determined_master_ip: 10.11.12.13
# determined_master_host: 10.11.12.13
# determined_master_port: 8080
# resource_pools:
# - pool_name: additional_pool
Expand Down
103 changes: 73 additions & 30 deletions master/cmd/determined-master/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -185,23 +185,20 @@ func getConfig(configMap map[string]interface{}) (*config.Config, error) {
}

func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]interface{}, error) {
// Preemption timeout moved from __internal to task_container_defaults
if internalMap, ok := configMap["__internal"].(map[string]interface{}); ok {
if oldPreemptTimeout, ok := internalMap["preemption_timeout"]; ok && oldPreemptTimeout != nil {
if preemptionDuration, err := time.ParseDuration(oldPreemptTimeout.(string)); err == nil {
preemptionTimeoutSeconds := int(preemptionDuration.Seconds())
if taskContainerMap, ok := configMap["task_container_defaults"].(map[string]interface{}); ok {
// Only set task_container_defaults from __internal if nil
if taskContainerTimeout, ok := taskContainerMap["preemption_timeout"]; !ok || taskContainerTimeout == nil {
taskContainerMap["preemption_timeout"] = preemptionTimeoutSeconds
configMap["task_container_defaults"] = taskContainerMap
}
}
}
}
delete(internalMap, "preemption_timeout")
err := shimOldRMConfig(configMap)
if err != nil {
return nil, err
}

shimPreemptionTimeout(configMap)
shimDeterminedMasterIP(configMap)
return configMap, nil
}

// Ensure we use either the old schema or the new one.
// Use configMap if RMs are not defined at all, or if they are defined using the new schema.
// If use the old schema, convert it to the new one.
func shimOldRMConfig(configMap map[string]interface{}) error {
const (
defaultVal = "default"
agentVal = "agent"
Expand All @@ -213,20 +210,17 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
vScheduler, schedulerExisted := configMap["scheduler"]
vProvisioner, provisionerExisted := configMap["provisioner"]

// Ensure we use either the old schema or the new one.
oldRMConfig := schedulerExisted || provisionerExisted
newRMConfig := rmExisted || rpsExisted
if newRMConfig && oldRMConfig {
return nil, errors.New(
return errors.New(
"cannot use the old and the new configuration schema at the same time",
)
}
if !oldRMConfig {
// Use configMap if RMs are not defined at all, or if they are defined using the new schema.
return configMap, nil
return nil
}

// If use the old schema, convert it to the new one.
newScheduler := map[string]interface{}{
"type": "priority",
"fitting_policy": "best",
Expand All @@ -237,38 +231,38 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
if schedulerExisted {
schedulerMap, ok := vScheduler.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for scheduler field")
return errors.New("wrong type for scheduler field")
}

if vFit, ok := schedulerMap["fit"]; ok {
newScheduler["fitting_policy"], ok = vFit.(string)
if !ok {
return nil, errors.New("wrong type for scheduler.fit field")
return errors.New("wrong type for scheduler.fit field")
}
}
if vType, ok := schedulerMap["type"]; ok {
newScheduler["type"], ok = vType.(string)
if !ok {
return nil, errors.New("wrong type for scheduler.type field")
return errors.New("wrong type for scheduler.type field")
}
}
if vRP, ok := schedulerMap["resource_provider"]; ok {
rpMap, ok := vRP.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for scheduler.resource_provider field")
return errors.New("wrong type for scheduler.resource_provider field")
}

vRPType, ok := rpMap["type"]
if ok {
switch vRPTypeStr, ok := vRPType.(string); {
case !ok:
return nil, errors.New("wrong type for scheduler.resource_provider.type field")
return errors.New("wrong type for scheduler.resource_provider.type field")
case vRPTypeStr == defaultVal:
newRM["type"] = agentVal
case vRPTypeStr == kubernetesVal:
newRM["type"] = kubernetesVal
default:
return nil, errors.New("wrong value for scheduler.resource_provider.type field")
return errors.New("wrong value for scheduler.resource_provider.type field")
}
} else {
newRM["type"] = agentVal
Expand Down Expand Up @@ -298,18 +292,18 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
if provisionerExisted {
provisionerMap, ok := vProvisioner.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for provisioner field")
return errors.New("wrong type for provisioner field")
}

newRP["provider"] = provisionerMap
if vProvider, ok := provisionerMap["provider"]; ok {
vProviderStr, ok := vProvider.(string)
if !ok {
return nil, errors.New("wrong type for provisioner.provider field")
return errors.New("wrong type for provisioner.provider field")
}

if vProviderStr != "aws" && vProviderStr != "gcp" {
return nil, errors.New("wrong value for provisioner.provider field")
return errors.New("wrong value for provisioner.provider field")
}

provisionerMap["type"] = provisionerMap["provider"]
Expand All @@ -321,6 +315,55 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i

delete(configMap, "scheduler")
delete(configMap, "provisioner")
return nil
}

return configMap, nil
// Preemption timeout moved from __internal to task_container_defaults
// Only set task_container_defaults from __internal if nil.
func shimPreemptionTimeout(configMap map[string]interface{}) {
if internalMap, ok := configMap["__internal"].(map[string]interface{}); ok {
if oldPreemptTimeout, ok := internalMap["preemption_timeout"]; ok && oldPreemptTimeout != nil {
if preemptionDuration, err := time.ParseDuration(oldPreemptTimeout.(string)); err == nil {
preemptionTimeoutSeconds := int(preemptionDuration.Seconds())
if taskContainerMap, ok := configMap["task_container_defaults"].(map[string]interface{}); ok {
if taskContainerTimeout, ok := taskContainerMap["preemption_timeout"]; !ok || taskContainerTimeout == nil {
taskContainerMap["preemption_timeout"] = preemptionTimeoutSeconds
configMap["task_container_defaults"] = taskContainerMap
}
}
}
}
delete(internalMap, "preemption_timeout")
}
}

// Rename from `determined_master_ip` to `determined_master_host` in old KubernetesRM configs.
func shimDeterminedMasterIP(configMap map[string]interface{}) {
shimDeterminedMasterIPInRM := func(rm map[string]any) {
if rm["type"] != "kubernetes" {
return
}
if ip, ok := rm["determined_master_ip"]; ok {
if _, ok := rm["determined_master_host"]; !ok {
rm["determined_master_host"] = ip
delete(rm, "determined_master_ip")
} else {
log.Warn("ignoring duplicated configuration `determined_master_ip`")
delete(rm, "determined_master_ip")
}
}
}
if rmi, ok := configMap["resource_manager"]; ok {
if rm, ok := rmi.(map[string]any); ok {
shimDeterminedMasterIPInRM(rm)
}
}
if rmis, ok := configMap["additional_resource_managers"]; ok {
rms, ok := rmis.([]map[string]any)
if ok {
for _, rm := range rms {
shimDeterminedMasterIPInRM(rm)
}
}
}
}
83 changes: 83 additions & 0 deletions master/cmd/determined-master/root_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,89 @@ func TestApplyBackwardsCompatibility(t *testing.T) {
},
},
},
{
name: "determined master ip to host rename, ip set",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_ip": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_ip": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
{
name: "determined master ip to host rename, both set",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
"determined_master_ip": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
"determined_master_ip": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
{
name: "determined master ip to host rename, host set get left alone",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
}
for ix := range tcs {
tc := tcs[ix]
Expand Down
Loading
Loading