-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils changes needed to add train api #1954
utils changes needed to add train api #1954
Conversation
Train api will look like this :- def train(
self,
name=None,
namespace=None,
workers=1,
model_args=None,
dataset_args=None,
parameters=None,
resources_per_worker={"gpu": 0, "cpu": 0, "memory": "10Gi"},
):
"""
Higher level train api
"""
if not name or not namespace:
raise ValueError("job name or namespace cannot be null")
# create init container spec
init_container_spec = utils.get_container_spec(
name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["init_container"],
image="storage image",
args=[model_args, dataset_args],
volume_mounts=models.V1VolumeMount(),
)
# create app container spec
container_spec = utils.get_container_spec(
name=constants.JOB_PARAMETERS[constants.PYTORCHJOB_KIND]["container"],
image="app image",
args=[parameters],
volume_mounts=models.V1VolumeMount(),
resources=resources_per_worker,
)
# create worker pod spec
worker_pod_template_spec = utils.get_pod_template_spec(
job_kind=constants.PYTORCHJOB_KIND,
containers_spec=[container_spec],
volumes_spec=[models.V1Volume()],
)
# create master pod spec
master_pod_template_spec = utils.get_pod_template_spec(
job_kind=constants.PYTORCHJOB_KIND,
containers_spec=[init_container_spec, container_spec],
volumes_spec=[models.V1Volume()],
)
job = utils.get_pytorchjob_template(
name=name,
namespace=namespace,
master_pod_template_spec= master_pod_template_spec,
worker_pod_template_spec=worker_pod_template_spec,
num_worker_replicas=workers,
num_procs_per_worker=resources_per_worker["gpu"],
elastic_policy=models.KubeflowOrgV1ElasticPolicy(rdzv_backend="c10d"),
)
self.create_job(job) |
Pull Request Test Coverage Report for Build 7128056562Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
b0f15e3
to
183b2c0
Compare
@johnugeorge I have made the suggested changes. |
2e2c25f
to
765bc00
Compare
|
||
# If number of Worker replicas is 1, PyTorchJob uses only Master replica. | ||
if num_worker_replicas != 1: | ||
if num_worker_replicas >= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, you should change it to this:
if num_worker_replicas >= 1: | |
if num_worker_replicas > 1: |
Otherwise, how we can use create_job
API to create PyTorchJob with 1 worker ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we set master_pod_template_spec to none and num_worker_replicas==1, then we can create pytorchjob with 1 worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge @deepanker13 @tenzen-y Is it correct to start PyTorchJob with one Worker when user wants to run single-worker training ? I was under impression that if PyTorchJob has 1 replica, we should set it under Master spec.
I just started PyTorchJob with a single Worker, and the labels for the pods are the following:
training.kubeflow.org/job-name=pytorch-simple
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work. Did you see any issues with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these labels correct @johnugeorge for a single worker training ?
E.g. for TFJob, I can see that we attach training.kubeflow.org/job-role=master
label in that case:
training.kubeflow.org/job-name=tfjob-test
training.kubeflow.org/job-role=master
training.kubeflow.org/operator-name=tfjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker
I think, it is due to this: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L611-L613
Should we have similar changes for all type of jobs ?
e.g. (PyTorchJob, XGBoostJob, PaddleJob) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Yes for now, in PyTorch job I am going to correct it
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deepanker13!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, deepanker13 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Changes needed for #1945
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
The new Train api is dependent on 1 new function (utils.get_container_spec), and 2 existing functions with some changes (utils.get_pod_template_spec and utils.get_pytorchjob_template)
Train api sample is mentioned in the first comment.
Train api will be run in elastic mode, and for elastic mode currently it is assumed that there will be 1 master replica.
Checklist: