-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Consolidate Naming for CRUD APIs #1907
[SDK] Consolidate Naming for CRUD APIs #1907
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I updated SDK examples and I moved them to |
/hold for review |
4adf8e4
to
19bb327
Compare
Pull Request Test Coverage Report for Build 6206887875
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, LGTM
def test_pytorchjob_from_func(job_namespace): | ||
# Test Training function. | ||
def train_func(parameters): | ||
import pandas as pd | ||
import time | ||
|
||
print(f"Package pandas=={pd.__version__} is installed") | ||
print(f"Input function parameters are: {parameters}") | ||
|
||
print("Stat Training ....") | ||
for i in range(10): | ||
print(f"Epoch: {i} finished") | ||
time.sleep(1) | ||
|
||
print("Training is complete") | ||
|
||
TRAINING_CLIENT.create_job( | ||
name=JOB_NAME, | ||
namespace=job_namespace, | ||
parameters={"lr": "0.01"}, | ||
train_func=train_func, | ||
num_worker_replicas=1, | ||
packages_to_install=["pandas==1.3.5"], | ||
) | ||
|
||
TRAINING_CLIENT.delete_pytorchjob(JOB_NAME, job_namespace) | ||
logging.info("Get created PyTorchJob from function") | ||
logging.info(TRAINING_CLIENT.get_job(JOB_NAME, job_namespace)) | ||
|
||
verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, timeout=900) | ||
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I think this test is for SDK, not the operator. So ideally, we should put this test on unit tests.
However, I'm not sure if we can put this test on unit tests.
@andreyvelich WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to test this functionality in E2Es as well, unless our unit tests are going to run Kubernetes cluster.
I agree, that we need to have unit test for our SDK, but it is separate discussion.
For this I want to verify that:
- Kubernetes can properly create containers when train function is embedded to the container arg.
- Packages can be downloaded after container is started.
I am not sure, how we can verify this in unit tests, unless we start Kubernetes cluster during unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that it might be better to run this test case as unit tests if there is a fake client for Python like client-go.
However, I could not find such a fake client for Python. So we should keep having this case in e2e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explore such options when we have time to add unit tests for our SDK client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
df3d177
to
fea6efb
Compare
@tenzen-y I removed test for Create PyTorchJob from func from our E2E for now. For tests, I print job info when |
fea6efb
to
2f49954
Compare
60cf3e6
to
ca02651
Compare
@tenzen-y Alright, test passed. I think, it was temporarily issue with MXnet dataset downloading. |
@tenzen-y Please let me know if there are any other comments that you want me to address. |
It will take a bit of time to review this PR again since I'm backlogged on PRs in multiple repositories. |
Sure, no problem. Thank you for your time! |
/lgtm |
/assign @tenzen-y |
@tenzen-y if you had a few minutes to check final changes that would be great! |
@andreyvelich Sorry for the late. Implementation looks good to me. However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.
|
@tenzen-y Please can you show what error did you get ? |
Once re-created my local cluster, the error went away. Sorry for the confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@tenzen-y During our E2Es testing I saw that sometime |
Or, we maybe want to load images to kind cluster with |
Fixes: #1877
Related: #1878
I unified CRUD APIs in our Training Operator SDK, so users can submit different Job much easier with
job_kind
parameter.Users can configure
namespace
andjob_kind
inTrainingClient()
once, so they can re-use it during API execution.Here are the list of public APIs that we expose to the user:
Create Job API now supports:
job
base_image
(Currently, only for TFJob and PyTorchJob).train_func
(Currently, only for TFJob and PyTorchJob).I removed
yapf
andpylint
configs, since we can use black + flake8 combination (KFP also uses flake8 for lint checks)It would be nice to add more unit test/lint checks for our SDK.
TODO: I still need to update examples with the new APIs in this PR./hold
It would be nice if you could start review the SDK changes.
/assign @kubeflow/wg-training-leads @tenzen-y @kuizhiqing @yaobaiwei @zw0610