Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pytorch): Add elastic proposal #522

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

gaocegege
Copy link
Member

Signed-off-by: cegao [email protected]

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: cegao <[email protected]>
Signed-off-by: cegao <[email protected]>
@gaocegege gaocegege changed the title feat(pytorch): Add elastic proposal WIP feat(pytorch): Add elastic proposal Aug 18, 2021
@gaocegege
Copy link
Member Author

/cc @kubeflow/wg-training-leads @alculquicondor @zw0610

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved
}
```

Two fields are added in `common.ReplicaSpec`: `minReplicas` and `maxReplicas`. They acts as MIN_SIZE and MAX_SIZE in the elastic example above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these fields make sense for every other operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think so. But we should discuss it further /cc @kubeflow/wg-training-leads

proposals/pytorch-elastic-proposal.md Show resolved Hide resolved
proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

### Environment Variables

`SetPodEnv` in `pkg/controller.v1/pytorch/pytorch.go` should be changed. There is no need to set `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` if TorchElastic is used. `KUBEFLOW_RDZV_HOST`, `KUBEFLOW_RDZV_PORT`, `KUBEFLOW_MIN_SIZE` and `KUBEFLOW_MAX_SIZE` Should be set instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what determines if a PyTorchJob is elastic or not? whether minReplicas and maxReplicas are different than nil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is selection of elastic vs non elastic execution for the operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1. introduce new field to indicate if it's elastic job.
Option 2. Compare min & max to tell controller implicitly.

I think operator side, it will just reconcile deltas, either add or remove which is already part of the logic. But we do need some changes to honor new fields

https://github.com/kubeflow/common/blob/2f3f636f16ef4cedb12543a96ac1412da98bbca5/pkg/reconciler.v1/common/pod.go#L139-L144

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Prefer the latter.

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

### Reconciliation

`JobController.ReconcilePods` should be refactored. Now the pods are returned by `GetPodSlices`. For example, if `spec.Replicas` is 3, the PodSlices may look like: `[[0],[1],[2]]`. It is not expected when elastic training is enabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should GetPodSlices return for elastic jobs instead?

How does the controller decide how many pods to create?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor I assume controller just provides the elastic ability. A different control loop should make the decision like an autoscaler?\

Copy link
Member Author

@gaocegege gaocegege Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will illustrate more about it in the proposal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it happens in a different control loop, then the pytorch controller will create the number of pods equal to replicas?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it happens in a separate control loop, why do we have to make minReplicas and maxReplicas part of the ReplicaSpec?

They could just be part of the HPA object.

value: "${pytorchjob.spec.replicas[worker].minReplicas}"
- name: KUBEFLOW_MAX_SIZE
value: "${pytorchjob.spec.replicas[worker].macReplicas}"
command: "python -m torch.distributed.run --rdzv_backend=c10d --rdzv_endpoint=$KUBEFLOW_RDZV_HOST:$KUBEFLOW_RDZV_PORT --nnodes=$KUBEFLOW_MIN_SIZE:$KUBEFLOW_MAX_SIZE --nproc_per_node=1 xxx.py"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the operator set a default command and then users can use the args to append more arguments and the python file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe users want to set their own entrypoint in the command, I think. Thus it may be better to keep command here, WDYT

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch 1.10 introduced torchrun, so you may need to be flexible to accomidate <1.10 versions that use python -m torch.distributed.run and >=1.10 with torchrun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, now we use the built-in environment variables PET_* to do it. Then I think we do not have the problem.

}
```

### Autoscaler Integration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the "alternatives considered"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to support it, but I am not sure how it affects API design. As you know, there is no built-in resource in Kubernetes which has minReplicas and maxReplicas except Autoscaler.

I was thinking if we should put them in the PyTorchJob CRD. Thus make it in the alternatives considered

@Jeffwan
Copy link
Member

Jeffwan commented Aug 19, 2021

Nice proposal. I will leave some comments tomorrow.

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

### Environment Variables

`SetPodEnv` in `pkg/controller.v1/pytorch/pytorch.go` should be changed. There is no need to set `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` if TorchElastic is used. `KUBEFLOW_RDZV_HOST`, `KUBEFLOW_RDZV_PORT`, `KUBEFLOW_MIN_SIZE` and `KUBEFLOW_MAX_SIZE` Should be set instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is selection of elastic vs non elastic execution for the operator?

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved
proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved
proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved
proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

## Limatations

- KUBEFLOW_RDZV_PORT will be open for every pod even though workers except worker-0 do not use it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add and expose this port only for Worker-0 pod ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can, do you mean we deal with it with a custom condition loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for example. If we don't want to have additional ports.

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved
containers:
- name: pytorch
image: <image>
command: "python -m torch.distributed.run --rdzv_backend=c10d --rdzv_endpoint=$KUBEFLOW_RDZV_HOST:$KUBEFLOW_RDZV_PORT --nnodes=$KUBEFLOW_MIN_SIZE:$KUBEFLOW_MAX_SIZE --nproc_per_node=1 xxx.py"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure it's user's choice on rendezvous backends? We don't want to manage this part, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. There are three backends now: static, c10d, etcd. And users can also implement their own backend like redis and so on. Users can specify it manually. If they use c10d, we can set rdzv endpoint for them. If they use etcd, they can set the endpoint by themselves.

Copy link
Member

@zw0610 zw0610 Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part may need correction since pytorch-elastic shall be able to read these environment variables directly without being specified as launch arguments.


[TorchElastic operator](https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go) implemented by @jeffwan puts the new fields under `PyTorchJobSpec`.

Personally, prefer keeping it in `common.ReplicaSpec` since other Jobs may also need it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on this direction

Personally, prefer keeping it in `common.ReplicaSpec` since other Jobs may also need it.

```diff
// PyTorchJobSpec is a desired state description of the PyTorchJob.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API side. Let's take this into consideration. We want to support user specified pods to scale in as an optional field to give more granular control
kubeflow/mpi-operator#410


## Abstract

[TorchElastic](https://pytorch.org/docs/1.9.0/distributed.elastic.html), which was open sourced over a year ago in the pytorch/elastic github repository, is a runner and coordinator for PyTorch worker processes. it has been part of PyTorch core since 1.9.0. This proposal is to support such feature with the help of PyTorchJob.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's involved to support additional frameworks with elastic capabilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we do not have a unified CRD to support all frameworks. Thus I think we can support different frameworks in different CRDs. This PR is for PyTorchJob.

WDYT

Copy link
Member

@terrytangyuan terrytangyuan Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We just need to highlight what features are available for which job kinds in the docs then.

@gaocegege
Copy link
Member Author

�We may also have this problem: pytorch/pytorch#65992

We should consider it in this proposal.

@gaocegege gaocegege marked this pull request as ready for review November 4, 2021 08:28
@gaocegege
Copy link
Member Author

/cc @qiankunli

@google-oss-prow
Copy link

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: qiankunli.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiankunli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: cegao <[email protected]>
@gaocegege
Copy link
Member Author

I updated the details about HPA. BTW, I think we should upgrade the PyTorchJob APIVersion to v2beta1 since it is a breaking change.

@alculquicondor
Copy link

What is breaking about it? If users don't set elastic configurations, wouldn't it work as before?

@gaocegege
Copy link
Member Author

I double-checked the CRD definition. We can keep compatibility with v1. common.JobStatus is changed but it does not block v1 if we upgrade the kubeflow/common version. Thus we can do it in v1.

type ReplicaStatus struct {
+	// LabelSelector is the selector for the replica.
+	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	// The number of actively running pods.
	Active int32 `json:"active,omitempty"`
	// The number of pods which reached phase Succeeded.
	Succeeded int32 `json:"succeeded,omitempty"`
	// The number of pods which reached phase Failed.
	Failed int32 `json:"failed,omitempty"`
}

@alculquicondor
Copy link

Right, API changes are breaking only if you remove or rename a field. I think the proposed changes look backwards compatible.

@gaocegege
Copy link
Member Author

gaocegege commented Nov 19, 2021

/cc @kubeflow/wg-training-leads

Could you please have another look?

kubeflow/training-operator#1453 The PR is ready, and the coverall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%

@Jeffwan
Copy link
Member

Jeffwan commented Nov 29, 2021

The proposal overall looks good to me. A thing I am not clear is the metrics HPA part. I feel it's better to decouple with the job controller but I understand user may need some simple solution for easy onboarding. A global optimizer is something I am looking for. We can discuss it later. Looks like the PR is merged, let's merge this one as well.

/lgtm

@theadactyl
Copy link
Contributor

@kubeflow/wg-training-leads could we get an "/approved" for this?

@terrytangyuan
Copy link
Member

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege, terrytangyuan
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member

/assign @james-jwu

@stale
Copy link

stale bot commented Apr 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@terrytangyuan
Copy link
Member

/assign @theadactyl

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.