Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils changes needed to add train api #1954

Merged
merged 9 commits into from
Dec 8, 2023

Conversation

deepanker13
Copy link
Contributor

@deepanker13 deepanker13 commented Nov 29, 2023

What this PR does / why we need it:
Changes needed for #1945
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

  1. The new Train api is dependent on 1 new function (utils.get_container_spec), and 2 existing functions with some changes (utils.get_pod_template_spec and utils.get_pytorchjob_template)

  2. Train api sample is mentioned in the first comment.

  3. Train api will be run in elastic mode, and for elastic mode currently it is assumed that there will be 1 master replica.

Checklist:

  • Docs included if any changes are user facing

@deepanker13
Copy link
Contributor Author

deepanker13 commented Nov 29, 2023

Train api will look like this :-

def train(
        self,
        name=None,
        namespace=None,
        workers=1,
        model_args=None,
        dataset_args=None,
        parameters=None,
        resources_per_worker={"gpu": 0, "cpu": 0, "memory": "10Gi"},
    ):
        """
        Higher level train api
        """
        if not name or not namespace:
            raise ValueError("job name or namespace cannot be null")

        # create init container spec
        init_container_spec = utils.get_container_spec(
            name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["init_container"],
            image="storage image",
            args=[model_args, dataset_args],
            volume_mounts=models.V1VolumeMount(),
        )

        # create app container spec
        container_spec = utils.get_container_spec(
            name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["container"],
            image="app image",
            args=[parameters],
            volume_mounts=models.V1VolumeMount(),
            resources=resources_per_worker,
        )

        # create worker pod spec
        worker_pod_template_spec = utils.get_pod_template_spec(
            job_kind=constants.PYTORCHJOB_KIND,
            containers_spec=[container_spec],
            volumes_spec=[models.V1Volume()],
        )

        # create master pod spec
        master_pod_template_spec = utils.get_pod_template_spec(
            job_kind=constants.PYTORCHJOB_KIND,
            containers_spec=[init_container_spec, container_spec],
            volumes_spec=[models.V1Volume()],
        )

        job = utils.get_pytorchjob_template(
            name=name,
            namespace=namespace,
            master_pod_template_spec= master_pod_template_spec,
            worker_pod_template_spec=worker_pod_template_spec,
            num_worker_replicas=workers,
            num_procs_per_worker=resources_per_worker["gpu"],
            elastic_policy=models.KubeflowOrgV1ElasticPolicy(rdzv_backend="c10d"),
        )

        self.create_job(job)

@coveralls
Copy link

coveralls commented Nov 29, 2023

Pull Request Test Coverage Report for Build 7128056562

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 42.872%

Totals Coverage Status
Change from base Build 7051444445: 0.01%
Covered Lines: 3753
Relevant Lines: 8754

💛 - Coveralls

@deepanker13
Copy link
Contributor Author

@johnugeorge I have made the suggested changes.

sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved

# If number of Worker replicas is 1, PyTorchJob uses only Master replica.
if num_worker_replicas != 1:
if num_worker_replicas >= 1:
Copy link
Member

@andreyvelich andreyvelich Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, you should change it to this:

Suggested change
if num_worker_replicas >= 1:
if num_worker_replicas > 1:

Otherwise, how we can use create_job API to create PyTorchJob with 1 worker ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we set master_pod_template_spec to none and num_worker_replicas==1, then we can create pytorchjob with 1 worker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @deepanker13 @tenzen-y Is it correct to start PyTorchJob with one Worker when user wants to run single-worker training ? I was under impression that if PyTorchJob has 1 replica, we should set it under Master spec.

I just started PyTorchJob with a single Worker, and the labels for the pods are the following:

training.kubeflow.org/job-name=pytorch-simple
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. Did you see any issues with it?

Copy link
Member

@andreyvelich andreyvelich Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these labels correct @johnugeorge for a single worker training ?
E.g. for TFJob, I can see that we attach training.kubeflow.org/job-role=master label in that case:

training.kubeflow.org/job-name=tfjob-test
 training.kubeflow.org/job-role=master
training.kubeflow.org/operator-name=tfjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker

I think, it is due to this: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L611-L613

Should we have similar changes for all type of jobs ?
e.g. (PyTorchJob, XGBoostJob, PaddleJob) ?

Copy link
Contributor Author

@deepanker13 deepanker13 Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Yes for now, in PyTorch job I am going to correct it

@johnugeorge
Copy link
Member

/lgtm

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @deepanker13!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, deepanker13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit ca9e7e3 into kubeflow:master Dec 8, 2023
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants