Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

andreyvelich · 2022-09-12T23:54:30Z

Fixes: kubeflow/common#66.
Inspired by KFP create_component_from_func.

These APIs will allow user to create TFJob and PyTorchJob without building the image.
This is the first small step to simplify our Kubeflow SDKs and to avoid Kubernetes complexity.

Later we can extend this functionality (support other Job types), give more spec options via APIs. Also, we might consider to use one TrainingClient() (instead of separate client for each Job) to reduce code and improve UX.

I used:

docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime for PyTorch base image.
docker.io/tensorflow/tensorflow:2.9.1 for Tensorflow base image.

cc @kubeflow/wg-training-leads @tenzen-y @anencore94 @ca-scribner Please give your feedback on the API design.

review-notebook-app · 2022-09-12T23:54:34Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2022-09-12T23:58:16Z

Pull Request Test Coverage Report for Build 3068518611

0 of 0 changed or added relevant lines in 0 files are covered.
7 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.07%) to 39.751%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/pytorch/master.go	1	91.3%
pkg/controller.v1/pytorch/initcontainer.go	6	80.0%

Totals
Change from base Build 2973439690:	-0.07%
Covered Lines:	2327
Relevant Lines:	5854

💛 - Coveralls

andreyvelich · 2022-09-13T00:09:29Z

/hold for the review

terrytangyuan

Thanks!

/lgtm

tenzen-y

@andreyvelich Thanks for this awesome work, and sorry for the late review.
I left a few comments.

sdk/python/setup.py

sdk/python/kubeflow/training/api/py_torch_job_client.py

sdk/python/kubeflow/training/constants/constants.py

tenzen-y · 2022-09-14T17:32:01Z

sdk/python/kubeflow/training/constants/constants.py

-TFJOB_LOGLEVEL = os.environ.get('TFJOB_LOGLEVEL', 'INFO').upper()
+TFJOB_LOGLEVEL = os.environ.get("TFJOB_LOGLEVEL", "INFO").upper()
+
+TFJOB_BASE_IMAGE = "docker.io/tensorflow/tensorflow:2.9.1"


It might be nice to use the default image with GPU support same as PYTORCHJOB_BASE_IMAGE.
WDYT?

@tenzen-y I think the problem is Tensorflow GPU image is 2.2Gb more than image with CPU support: https://hub.docker.com/r/tensorflow/tensorflow/tags?page=1&name=2.9.1.
Maybe we can introduce TFJOB_BASE_IMAGE and TFJOB_BASE_IMAGE_GPU in our SDK, what do you think ?

What are your thoughts on that @johnugeorge @tenzen-y @anencore94 ?

I also think introducing both cpu,gpu is nice to have, but set the cpu image as the default image would be safe.

sgtm

In that case, we also can introduce PYTORCHJOB_BASE_IMAGE and PYTORCHJOB_BASE_IMAGE_GPU as well.

@tenzen-y I wasn't able to find official PyTorch image with CPU support only: https://hub.docker.com/r/pytorch/pytorch/tags
Are you aware of those ?

Sorry, I assumed it existed...

@tenzen-y @anencore94 I've added the Tensorflow GPU Image in the constants.

Thanks for updating!

Thanks a lot !

sdk/python/kubeflow/training/utils/utils.py

sdk/python/kubeflow/training/constants/constants.py

sdk/python/kubeflow/training/api/py_torch_job_client.py

sdk/python/examples/create-pytorchjob-from-func.ipynb

Add Client info in example

johnugeorge · 2022-09-15T16:32:27Z

Thanks @andreyvelich for this work. This is a good story to start.

/lgtm

johnugeorge · 2022-09-15T16:33:45Z

/hold for LGTM from @tenzen-y

tenzen-y

@andreyvelich Thanks for the awesome work!
/lgtm

google-oss-prow · 2022-09-16T00:25:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnugeorge · 2022-09-16T06:34:55Z

Ready to get merged

andreyvelich · 2022-09-16T14:21:16Z

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review!
I believe all comments have been addressed.

If you are ok, we can merge this PR.

tenzen-y · 2022-09-16T14:24:47Z

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review! I believe all comments have been addressed.

If you are ok, we can merge this PR.

I'm ok.
/lgtm

anencore94 · 2022-09-16T16:49:19Z

@johnugeorge @tenzen-y @anencore94 thanks a lot for the review! I believe all comments have been addressed.

If you are ok, we can merge this PR.

Thaks a lot for introducing such a useful feature!
/lgtm

andreyvelich · 2022-09-16T18:01:52Z

Thanks everyone!
/hold cancel

Create TFJob and PyTorchJob from Function APIs in the Training SDK

3f1e425

google-oss-prow bot added the size/XXL label Sep 12, 2022

google-oss-prow bot requested review from jinchihe and terrytangyuan September 12, 2022 23:54

andreyvelich mentioned this pull request Sep 13, 2022

Create Tune API in the Katib SDK kubeflow/katib#1951

Merged

google-oss-prow bot added the do-not-merge/hold label Sep 13, 2022

Install SDK in tests

f219d79

andreyvelich requested a review from a team September 14, 2022 11:34

terrytangyuan approved these changes Sep 14, 2022

View reviewed changes

google-oss-prow bot assigned terrytangyuan Sep 14, 2022

google-oss-prow bot added lgtm approved labels Sep 14, 2022

tenzen-y reviewed Sep 14, 2022

View reviewed changes

sdk/python/kubeflow/training/api/py_torch_job_client.py Outdated Show resolved Hide resolved

johnugeorge reviewed Sep 14, 2022

View reviewed changes

sdk/python/examples/create-pytorchjob-from-func.ipynb Show resolved Hide resolved

Use Kubeflow Group Const

b70eeaa

google-oss-prow bot removed the lgtm label Sep 15, 2022

Use Final for constants

cfed35d

Add Client info in example

google-oss-prow bot assigned johnugeorge Sep 15, 2022

google-oss-prow bot added the lgtm label Sep 15, 2022

Modify packages_to_install doc

488c710

google-oss-prow bot removed the lgtm label Sep 15, 2022

tenzen-y approved these changes Sep 16, 2022

View reviewed changes

google-oss-prow bot assigned tenzen-y Sep 16, 2022

google-oss-prow bot added the lgtm label Sep 16, 2022

Add Tensorflow GPU Image

2056231

google-oss-prow bot removed the lgtm label Sep 16, 2022

google-oss-prow bot added the lgtm label Sep 16, 2022

google-oss-prow bot assigned anencore94 Sep 16, 2022

google-oss-prow bot removed the do-not-merge/hold label Sep 16, 2022

google-oss-prow bot merged commit 8c9b33c into kubeflow:master Sep 16, 2022

andreyvelich deleted the update-pytorch-sdk-create-from-func branch September 16, 2022 18:02

andreyvelich mentioned this pull request Aug 2, 2023

[SDK] Create Job From Docker API #1878

Open

StefanoFioravanzo mentioned this pull request Mar 6, 2024

Fine-Tune APIs for LLM Documentation #2013

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

andreyvelich commented Sep 12, 2022 •

edited

Loading

review-notebook-app bot commented Sep 12, 2022

coveralls commented Sep 12, 2022 •

edited

Loading

andreyvelich commented Sep 13, 2022

terrytangyuan left a comment

tenzen-y left a comment

tenzen-y Sep 14, 2022

andreyvelich Sep 15, 2022

andreyvelich Sep 15, 2022

anencore94 Sep 15, 2022

tenzen-y Sep 15, 2022

andreyvelich Sep 15, 2022

tenzen-y Sep 16, 2022

andreyvelich Sep 16, 2022

tenzen-y Sep 16, 2022

anencore94 Sep 16, 2022

johnugeorge commented Sep 15, 2022

johnugeorge commented Sep 15, 2022

tenzen-y left a comment

google-oss-prow bot commented Sep 16, 2022

johnugeorge commented Sep 16, 2022

andreyvelich commented Sep 16, 2022

tenzen-y commented Sep 16, 2022

anencore94 commented Sep 16, 2022

andreyvelich commented Sep 16, 2022

Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659

Conversation

andreyvelich commented Sep 12, 2022 • edited Loading

review-notebook-app bot commented Sep 12, 2022

coveralls commented Sep 12, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3068518611

💛 - Coveralls

andreyvelich commented Sep 13, 2022

terrytangyuan left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Sep 15, 2022

johnugeorge commented Sep 15, 2022

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Sep 16, 2022

johnugeorge commented Sep 16, 2022

andreyvelich commented Sep 16, 2022

tenzen-y commented Sep 16, 2022

anencore94 commented Sep 16, 2022

andreyvelich commented Sep 16, 2022

andreyvelich commented Sep 12, 2022 •

edited

Loading

coveralls commented Sep 12, 2022 •

edited

Loading