-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 3068518611
💛 - Coveralls |
/hold for the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this awesome work, and sorry for the late review.
I left a few comments.
TFJOB_LOGLEVEL = os.environ.get('TFJOB_LOGLEVEL', 'INFO').upper() | ||
TFJOB_LOGLEVEL = os.environ.get("TFJOB_LOGLEVEL", "INFO").upper() | ||
|
||
TFJOB_BASE_IMAGE = "docker.io/tensorflow/tensorflow:2.9.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to use the default image with GPU support same as PYTORCHJOB_BASE_IMAGE
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I think the problem is Tensorflow GPU image is 2.2Gb more than image with CPU support: https://hub.docker.com/r/tensorflow/tensorflow/tags?page=1&name=2.9.1.
Maybe we can introduce TFJOB_BASE_IMAGE
and TFJOB_BASE_IMAGE_GPU
in our SDK, what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on that @johnugeorge @tenzen-y @anencore94 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think introducing both cpu,gpu is nice to have, but set the cpu image as the default image would be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
In that case, we also can introduce PYTORCHJOB_BASE_IMAGE
and PYTORCHJOB_BASE_IMAGE_GPU
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I wasn't able to find official PyTorch image with CPU support only: https://hub.docker.com/r/pytorch/pytorch/tags
Are you aware of those ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I assumed it existed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @anencore94 I've added the Tensorflow GPU Image in the constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot !
Add Client info in example
Thanks @andreyvelich for this work. This is a good story to start. /lgtm |
/hold for LGTM from @tenzen-y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for the awesome work!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ready to get merged |
@johnugeorge @tenzen-y @anencore94 thanks a lot for the review! If you are ok, we can merge this PR. |
I'm ok. |
Thaks a lot for introducing such a useful feature! |
Thanks everyone! |
Fixes: kubeflow/common#66.
Inspired by KFP
create_component_from_func
.These APIs will allow user to create
TFJob
andPyTorchJob
without building the image.This is the first small step to simplify our Kubeflow SDKs and to avoid Kubernetes complexity.
Later we can extend this functionality (support other Job types), give more spec options via APIs. Also, we might consider to use one
TrainingClient()
(instead of separate client for each Job) to reduce code and improve UX.I used:
docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
for PyTorch base image.docker.io/tensorflow/tensorflow:2.9.1
for Tensorflow base image.cc @kubeflow/wg-training-leads @tenzen-y @anencore94 @ca-scribner Please give your feedback on the API design.