utils changes needed to add train api #1954

deepanker13 · 2023-11-29T11:53:02Z

What this PR does / why we need it:
Changes needed for #1945
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

The new Train api is dependent on 1 new function (utils.get_container_spec), and 2 existing functions with some changes (utils.get_pod_template_spec and utils.get_pytorchjob_template)
Train api sample is mentioned in the first comment.
Train api will be run in elastic mode, and for elastic mode currently it is assumed that there will be 1 master replica.

Checklist:

Docs included if any changes are user facing

deepanker13 · 2023-11-29T11:53:52Z

Train api will look like this :-

def train(
        self,
        name=None,
        namespace=None,
        workers=1,
        model_args=None,
        dataset_args=None,
        parameters=None,
        resources_per_worker={"gpu": 0, "cpu": 0, "memory": "10Gi"},
    ):
        """
        Higher level train api
        """
        if not name or not namespace:
            raise ValueError("job name or namespace cannot be null")

        # create init container spec
        init_container_spec = utils.get_container_spec(
            name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["init_container"],
            image="storage image",
            args=[model_args, dataset_args],
            volume_mounts=models.V1VolumeMount(),
        )

        # create app container spec
        container_spec = utils.get_container_spec(
            name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["container"],
            image="app image",
            args=[parameters],
            volume_mounts=models.V1VolumeMount(),
            resources=resources_per_worker,
        )

        # create worker pod spec
        worker_pod_template_spec = utils.get_pod_template_spec(
            job_kind=constants.PYTORCHJOB_KIND,
            containers_spec=[container_spec],
            volumes_spec=[models.V1Volume()],
        )

        # create master pod spec
        master_pod_template_spec = utils.get_pod_template_spec(
            job_kind=constants.PYTORCHJOB_KIND,
            containers_spec=[init_container_spec, container_spec],
            volumes_spec=[models.V1Volume()],
        )

        job = utils.get_pytorchjob_template(
            name=name,
            namespace=namespace,
            master_pod_template_spec= master_pod_template_spec,
            worker_pod_template_spec=worker_pod_template_spec,
            num_worker_replicas=workers,
            num_procs_per_worker=resources_per_worker["gpu"],
            elastic_policy=models.KubeflowOrgV1ElasticPolicy(rdzv_backend="c10d"),
        )

        self.create_job(job)

coveralls · 2023-11-29T11:56:06Z

Pull Request Test Coverage Report for Build 7128056562

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 42.872%

Totals
Change from base Build 7051444445:	0.01%
Covered Lines:	3753
Relevant Lines:	8754

💛 - Coveralls

deepanker13 · 2023-12-01T15:39:59Z

@johnugeorge I have made the suggested changes.

sdk/python/kubeflow/training/utils/utils.py

andreyvelich · 2023-12-06T19:48:21Z

sdk/python/kubeflow/training/utils/utils.py


-    # If number of Worker replicas is 1, PyTorchJob uses only Master replica.
-    if num_worker_replicas != 1:
+    if num_worker_replicas >= 1:


I think, you should change it to this:

Suggested change

if num_worker_replicas >= 1:

if num_worker_replicas > 1:

Otherwise, how we can use create_job API to create PyTorchJob with 1 worker ?

if we set master_pod_template_spec to none and num_worker_replicas==1, then we can create pytorchjob with 1 worker

@johnugeorge @deepanker13 @tenzen-y Is it correct to start PyTorchJob with one Worker when user wants to run single-worker training ? I was under impression that if PyTorchJob has 1 replica, we should set it under Master spec.

I just started PyTorchJob with a single Worker, and the labels for the pods are the following:

training.kubeflow.org/job-name=pytorch-simple training.kubeflow.org/operator-name=pytorchjob-controller training.kubeflow.org/replica-index=0 training.kubeflow.org/replica-type=worker

It should work. Did you see any issues with it?

Are these labels correct @johnugeorge for a single worker training ?
E.g. for TFJob, I can see that we attach training.kubeflow.org/job-role=master label in that case:

training.kubeflow.org/job-name=tfjob-test training.kubeflow.org/job-role=master training.kubeflow.org/operator-name=tfjob-controller training.kubeflow.org/replica-index=0 training.kubeflow.org/replica-type=worker

I think, it is due to this: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L611-L613

Should we have similar changes for all type of jobs ?
e.g. (PyTorchJob, XGBoostJob, PaddleJob) ?

@andreyvelich Yes for now, in PyTorch job I am going to correct it

johnugeorge · 2023-12-07T19:32:09Z

/lgtm

andreyvelich

Thanks @deepanker13!
/lgtm
/approve

google-oss-prow · 2023-12-08T11:34:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, deepanker13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the size/M label Nov 29, 2023

google-oss-prow bot requested review from jinchihe and kuizhiqing November 29, 2023 11:53

deepanker13 mentioned this pull request Nov 29, 2023

Train/Fine-tune API Proposal for LLMs #1945

Merged

deepanker13 added 5 commits December 1, 2023 20:38

utils changes needed to add train api

9b1d9d0

adding constant for init container name

23311d0

adding master pod template spec and chief replica spec

851932d

renaming vars and adding master spec

804677d

github workflow fixes

183b2c0

deepanker13 force-pushed the train_api_utils_change branch from b0f15e3 to 183b2c0 Compare December 1, 2023 15:24

removing constants file changes from this pr

84de95c

andreyvelich reviewed Dec 1, 2023

View reviewed changes

deepanker13 added 2 commits December 5, 2023 15:28

code review changes

e33dc19

bug fix

765bc00

deepanker13 force-pushed the train_api_utils_change branch from 2e2c25f to 765bc00 Compare December 5, 2023 10:48

andreyvelich reviewed Dec 6, 2023

View reviewed changes

adding get_container_spec in get_pod_template_spec

09a321b

google-oss-prow bot assigned johnugeorge Dec 7, 2023

google-oss-prow bot added the lgtm label Dec 7, 2023

deepanker13 mentioned this pull request Dec 8, 2023

Train api dataset download changes #1959

Merged

1 task

andreyvelich approved these changes Dec 8, 2023

View reviewed changes

google-oss-prow bot assigned andreyvelich Dec 8, 2023

google-oss-prow bot added the approved label Dec 8, 2023

google-oss-prow bot merged commit ca9e7e3 into kubeflow:master Dec 8, 2023
33 checks passed

andreyvelich mentioned this pull request Dec 8, 2023

Fix label when PyTorchJob has single Worker #1961

Closed

rimolive mentioned this pull request Jan 22, 2024

Training WG roadmap for KF 1.9 kubeflow/manifests#2597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils changes needed to add train api #1954

utils changes needed to add train api #1954

deepanker13 commented Nov 29, 2023 •

edited

Loading

deepanker13 commented Nov 29, 2023 •

edited

Loading

coveralls commented Nov 29, 2023 •

edited

Loading

deepanker13 commented Dec 1, 2023

andreyvelich Dec 6, 2023 •

edited

Loading

deepanker13 Dec 7, 2023

andreyvelich Dec 7, 2023

johnugeorge Dec 7, 2023

andreyvelich Dec 7, 2023 •

edited

Loading

deepanker13 Dec 8, 2023 •

edited

Loading

johnugeorge commented Dec 7, 2023

andreyvelich left a comment

google-oss-prow bot commented Dec 8, 2023

utils changes needed to add train api #1954

utils changes needed to add train api #1954

Conversation

deepanker13 commented Nov 29, 2023 • edited Loading

deepanker13 commented Nov 29, 2023 • edited Loading

coveralls commented Nov 29, 2023 • edited Loading

Pull Request Test Coverage Report for Build 7128056562

Warning: This coverage report may be inaccurate.

💛 - Coveralls

deepanker13 commented Dec 1, 2023

andreyvelich Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

deepanker13 Dec 7, 2023

Choose a reason for hiding this comment

andreyvelich Dec 7, 2023

Choose a reason for hiding this comment

johnugeorge Dec 7, 2023

Choose a reason for hiding this comment

andreyvelich Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

deepanker13 Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

johnugeorge commented Dec 7, 2023

andreyvelich left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Dec 8, 2023

deepanker13 commented Nov 29, 2023 •

edited

Loading

deepanker13 commented Nov 29, 2023 •

edited

Loading

coveralls commented Nov 29, 2023 •

edited

Loading

andreyvelich Dec 6, 2023 •

edited

Loading

andreyvelich Dec 7, 2023 •

edited

Loading

deepanker13 Dec 8, 2023 •

edited

Loading