Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support addTask and startTask for k8 RP [DET-3416, DET-3419] #798

Merged
merged 9 commits into from
Jul 13, 2020

Conversation

aaron276h
Copy link
Contributor

Description

This PR supports addTask and startTask for Kubernetes. It includes the initial implementation of the Kubernetes RP actor, the pods actor, and the pod actor. So far these actors take task spec and translate it to a pod spec amd submit the pod spec to k8.

Test Plan

Tested manually by running on a k8 cluster. Adding CI will be part of M4.

Commentary

One thing that I wish we could do better is not using scheduler/agentState to represent the Kubernetes resources in the scheduler. However, refactoring this will be a significant amount of work that we will plan to complete as future tech debt work.

@@ -267,6 +267,39 @@ func (m *Master) rwCoordinatorWebSocket(socket *websocket.Conn, c echo.Context)
return actorRef.AwaitTermination()
}

func (m *Master) initializeResourceProviders(proxyRef *actor.Ref, provisionerSlotsPerInstance int) {
var resourceProvider *actor.Ref
if m.config.Scheduler.ResourceProvider.DefaultRPConfig != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Be more explicit about selecting which case to use:

switch {
  case ...DefaultRPConfig != nil:
    ...
  case ...KubernetesRPConfig != nil:
    ...
  default:
    panic("...")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -28,6 +28,8 @@ const (
TrialEntrypointScriptResource = "entrypoint.sh"
// AgentSetupScriptTemplateResource is the template for the script to run a dynamic agent.
AgentSetupScriptTemplateResource = "agent_setup_script.sh.template"
// InitContainerEntryScriptResource is the script to run the init container on k8s.
InitContainerEntryScriptResource = "init_container_entrypoint.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Put something about Kubernetes into the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

)

if !ok {
return errors.Errorf(fmt.Sprintf("pod actor %s already exits", ref.Address().String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Errorf(fmt.Sprintf("pod actor %s already exits", ref.Address().String()))
return errors.Errorf("pod actor %s already exists", ref.Address().String())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return newTaskSummary(a.task)
}

type podAssignment struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Probably these (both this and containerAssignment) should just be moved into the respective *_resource_provider.go files. That feels like a more natural grouping to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved them over, although it doesn't seem that great adding this to the default_resource_provider as that has bloated to 600+ lines.

k.receiveStartTask(ctx, msg)

default:
ctx.Log().Error("Unexpected message", reflect.TypeOf(msg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't normally bother logging these, so I'd probably just drop this (and the other instance in the PR). But if you really want to have it here, it should use %T rather than reflect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I find these logs useful, especially during development. I updated to not use reflect.

master/internal/kubernetes/pod.go Show resolved Hide resolved
master/internal/kubernetes/pod.go Show resolved Hide resolved
master/internal/kubernetes/pod.go Show resolved Hide resolved
@aaron276h aaron276h assigned aaron276h and unassigned dzhu Jul 9, 2020
As part of this change added a pods actor that is part of the
k8 RP, and added several additional configurations to the
k8 RP config.
Translate task spec to pod spec and submit to k8.
Copy link
Contributor Author

@aaron276h aaron276h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzhu I made the changes as a separate new commit.

@@ -267,6 +267,39 @@ func (m *Master) rwCoordinatorWebSocket(socket *websocket.Conn, c echo.Context)
return actorRef.AwaitTermination()
}

func (m *Master) initializeResourceProviders(proxyRef *actor.Ref, provisionerSlotsPerInstance int) {
var resourceProvider *actor.Ref
if m.config.Scheduler.ResourceProvider.DefaultRPConfig != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

master/internal/kubernetes/pod.go Show resolved Hide resolved
master/internal/kubernetes/pod.go Show resolved Hide resolved
)

if !ok {
return errors.Errorf(fmt.Sprintf("pod actor %s already exits", ref.Address().String()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return newTaskSummary(a.task)
}

type podAssignment struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved them over, although it doesn't seem that great adding this to the default_resource_provider as that has bloated to 600+ lines.

k.receiveStartTask(ctx, msg)

default:
ctx.Log().Error("Unexpected message", reflect.TypeOf(msg))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I find these logs useful, especially during development. I updated to not use reflect.

@@ -28,6 +28,8 @@ const (
TrialEntrypointScriptResource = "entrypoint.sh"
// AgentSetupScriptTemplateResource is the template for the script to run a dynamic agent.
AgentSetupScriptTemplateResource = "agent_setup_script.sh.template"
// InitContainerEntryScriptResource is the script to run the init container on k8s.
InitContainerEntryScriptResource = "init_container_entrypoint.sh"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

master/internal/kubernetes/pod.go Show resolved Hide resolved
@aaron276h aaron276h assigned dzhu and unassigned aaron276h Jul 9, 2020
Copy link
Contributor

@dzhu dzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks; see kubernetes_resource_provider.go and volumes.go for some that may actually matter.

func (k *kubernetesResourceProvider) receiveAddTask(ctx *actor.Context, msg AddTask) {
actors.NotifyOnStop(ctx, msg.TaskHandler, taskStopped{Ref: msg.TaskHandler})

if task, ok := k.tasksByHandler[ctx.Sender()]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Should this be msg.TaskHandler? Isn't the sender always the resource providers actor due to forwarding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really good catch. Had the same mistake in the default RP also.


if task.SlotsNeeded()%k.slotsPerNode != 0 {
ctx.Log().WithField("task ID", task.ID).Error(
"task number of slots is not schedulable on the configured slots_per_node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: This could bear to be more explicit:

Suggested change
"task number of slots is not schedulable on the configured slots_per_node")
"task is not schedulable: number of slots (%d) is not a multiple of slots_per_node (%d)", task.SlotsNeeded(), k.slotsPerNode)

(adjusted for line length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
Spec: v1.PodSpec{
Volumes: volumes,
HostNetwork: p.taskSpec.TaskContainerDefaults.NetworkMode == hostNetwork,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking:

Suggested change
HostNetwork: p.taskSpec.TaskContainerDefaults.NetworkMode == hostNetwork,
HostNetwork: p.taskSpec.TaskContainerDefaults.NetworkMode.IsHost(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah good idea

"github.com/docker/docker/api/types/mount"

"github.com/determined-ai/determined/master/pkg/container"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Put all the internal imports into one block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"github.com/determined-ai/determined/master/pkg/model"
"github.com/determined-ai/determined/master/pkg/tasks"

v1 "k8s.io/api/core/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: corev1 or k8sv1, maybe?

// LocalRendezvousPort is the start of the range of ports used for rendezvous by tasks.
const LocalRendezvousPort = 1734

// LocalRendezvousPortOffset is the offset for rendezvous ports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Just "the offset" isn't super clear. Maybe "...the difference between the two rendezvous ports for each container"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it

@@ -18,16 +18,17 @@ import (
)

const (
containerWorkDir = "/run/determined/workdir"
// ContainerWorkDir is working directory for containers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ContainerWorkDir is working directory for containers.
// ContainerWorkDir is the working directory for tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

return user
}

// ConfigureCommandEnvVars configures environment variables for cmd tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: ConfigureXXX for all of the new functions here feels a bit off, since that sounds like something that actually goes out and tweaks settings somewhere as part of the function. I think just the XXX part usually sounds good (CommandEnvVars, CommandArchives, etc.), or maybe XXXConfiguration if that's too bare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, updated these

return envVarsMap
}

// ConfigureCommandArchives returns the additional files for a c as an archive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ConfigureCommandArchives returns the additional files for a c as an archive.
// ConfigureCommandArchives returns the additional files for a command as an archive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

envVars := defaultEnvVars()
envVars = append(envVars, gcc.ExperimentConfig.Environment.EnvironmentVariables.For(deviceType)...)
mounts := toDockerMounts(gcc.ExperimentConfig.BindMounts)
// ConfigureGCEnvVars returns environment variables for gc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking:

Suggested change
// ConfigureGCEnvVars returns environment variables for gc.
// ConfigureGCEnvVars returns environment variables for checkpoint GC.

(Also several spots below.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@dzhu dzhu assigned aaron276h and unassigned dzhu Jul 11, 2020
}

func (k *kubernetesResourceProvider) getOrCreateGroup(
handler *actor.Ref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Let's keep the context first everywhere it's passed as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@aaron276h aaron276h merged commit 9e1664c into determined-ai:master Jul 13, 2020
@dannysauer dannysauer added this to the 0.12.12 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants