[SDK] Consolidate Naming for CRUD APIs #1907

andreyvelich · 2023-09-07T19:26:51Z

Fixes: #1877
Related: #1878

I unified CRUD APIs in our Training Operator SDK, so users can submit different Job much easier with job_kind parameter.
Users can configure namespace and job_kind in TrainingClient() once, so they can re-use it during API execution.

Here are the list of public APIs that we expose to the user:

create_job()
get_job()
get_job_conditions()
get_job_logs()
get_job_pod_names()
is_job_created()
is_job_failed()
is_job_restarting()
is_job_running()
is_job_succeeded()
wait_for_job_conditions()
list_jobs()
update_job()
delete_job()

Create Job API now supports:

Create Job from object - job
Create Job from Docker image - base_image (Currently, only for TFJob and PyTorchJob).
Create Job from function - train_func (Currently, only for TFJob and PyTorchJob).

I removed yapf and pylint configs, since we can use black + flake8 combination (KFP also uses flake8 for lint checks)
It would be nice to add more unit test/lint checks for our SDK.

~~TODO: I still need to update examples with the new APIs in this PR.~~
/hold

It would be nice if you could start review the SDK changes.
/assign @kubeflow/wg-training-leads @tenzen-y @kuizhiqing @yaobaiwei @zw0610

review-notebook-app · 2023-09-08T21:49:06Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2023-09-08T22:06:24Z

I updated SDK examples and I moved them to /examples/sdk, so users can find them easily.
Please take a look at the changes once you have time.

andreyvelich · 2023-09-08T22:07:34Z

/hold for review

coveralls · 2023-09-08T22:19:03Z

Pull Request Test Coverage Report for Build 6206887875

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 42.777%

Totals
Change from base Build 6204968357:	0.0%
Covered Lines:	3737
Relevant Lines:	8736

💛 - Coveralls

tenzen-y

Basically, LGTM

.github/workflows/test-python.yaml

.pylintrc

docs/development/developer_guide.md

sdk/python/kubeflow/training/api/training_client.py

sdk/python/kubeflow/training/utils/utils.py

sdk/python/test/e2e/test_e2e_mpijob.py

tenzen-y · 2023-09-13T07:47:20Z

sdk/python/test/e2e/test_e2e_pytorchjob.py

+def test_pytorchjob_from_func(job_namespace):
+    # Test Training function.
+    def train_func(parameters):
+        import pandas as pd
+        import time
+
+        print(f"Package pandas=={pd.__version__} is installed")
+        print(f"Input function parameters are: {parameters}")
+
+        print("Stat Training ....")
+        for i in range(10):
+            print(f"Epoch: {i} finished")
+            time.sleep(1)
+
+        print("Training is complete")
+
+    TRAINING_CLIENT.create_job(
+        name=JOB_NAME,
+        namespace=job_namespace,
+        parameters={"lr": "0.01"},
+        train_func=train_func,
+        num_worker_replicas=1,
+        packages_to_install=["pandas==1.3.5"],
    )

-    TRAINING_CLIENT.delete_pytorchjob(JOB_NAME, job_namespace)
+    logging.info("Get created PyTorchJob from function")
+    logging.info(TRAINING_CLIENT.get_job(JOB_NAME, job_namespace))
+
+    verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, timeout=900)
+    TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)


Uhm, I think this test is for SDK, not the operator. So ideally, we should put this test on unit tests.
However, I'm not sure if we can put this test on unit tests.
@andreyvelich WDYT?

We might need to test this functionality in E2Es as well, unless our unit tests are going to run Kubernetes cluster.
I agree, that we need to have unit test for our SDK, but it is separate discussion.

For this I want to verify that:

Kubernetes can properly create containers when train function is embedded to the container arg.

Packages can be downloaded after container is started.

I am not sure, how we can verify this in unit tests, unless we start Kubernetes cluster during unit tests.

I was thinking that it might be better to run this test case as unit tests if there is a fake client for Python like client-go.

However, I could not find such a fake client for Python. So we should keep having this case in e2e.

Let's explore such options when we have time to add unit tests for our SDK client.

andreyvelich · 2023-09-14T23:23:42Z

@tenzen-y I removed test for Create PyTorchJob from func from our E2E for now.
I am not sure why tests failed, since locally everything is working fine. So we can unblock this PR.
We can work on adding tests for such use-cases later.

For tests, I print job info when verify_e2e failed, so we can see some details on failed tests.

andreyvelich · 2023-09-16T11:27:00Z

@tenzen-y Alright, test passed. I think, it was temporarily issue with MXnet dataset downloading.

andreyvelich · 2023-09-18T11:09:47Z

@tenzen-y Please let me know if there are any other comments that you want me to address.
/hold cancel

tenzen-y · 2023-09-18T14:45:37Z

@tenzen-y Please let me know if there are any other comments that you want me to address. /hold cancel

It will take a bit of time to review this PR again since I'm backlogged on PRs in multiple repositories.
Thanks for your patience.

andreyvelich · 2023-09-18T14:55:05Z

@tenzen-y Please let me know if there are any other comments that you want me to address. /hold cancel

It will take a bit of time to review this PR again since I'm backlogged on PRs in multiple repositories. Thanks for your patience.

Sure, no problem. Thank you for your time!

johnugeorge · 2023-09-20T19:02:18Z

/lgtm

johnugeorge · 2023-09-21T04:57:16Z

/assign @tenzen-y

andreyvelich · 2023-09-22T12:52:44Z

@tenzen-y if you had a few minutes to check final changes that would be great!

tenzen-y · 2023-09-22T13:00:15Z

@tenzen-y if you had a few minutes to check final changes that would be great!

@andreyvelich Sorry for the late. Implementation looks good to me. However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y I removed test for Create PyTorchJob from func from our E2E for now.
I am not sure why tests failed, since locally everything is working fine. So we can unblock this PR.
We can work on adding tests for such use-cases later.

andreyvelich · 2023-09-22T13:57:20Z

However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y Please can you show what error did you get ?

tenzen-y · 2023-09-22T14:24:26Z

However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y Please can you show what error did you get ?

Once re-created my local cluster, the error went away. Sorry for the confusion.

tenzen-y

Thank you!
/lgtm
/approve

google-oss-prow · 2023-09-22T14:26:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2023-09-22T14:42:18Z

Once re-created my local cluster, the error went away. Sorry for the confusion.

@tenzen-y During our E2Es testing I saw that sometime kind just can't pulled image. Maybe we should consider to use minikube, similar to Katib E2Es.

tenzen-y · 2023-09-22T14:46:14Z

Once re-created my local cluster, the error went away. Sorry for the confusion.

@tenzen-y During our E2Es testing I saw that sometime kind just can't pulled image. Maybe we should consider to use minikube, similar to Katib E2Es.

Or, we maybe want to load images to kind cluster with kind load docker-image XXX before starting test.

google-oss-prow bot added do-not-merge/work-in-progress do-not-merge/hold labels Sep 7, 2023

google-oss-prow bot requested review from jinchihe and kuizhiqing September 7, 2023 19:26

google-oss-prow bot added the size/XXL label Sep 7, 2023

tenzen-y mentioned this pull request Sep 8, 2023

Removing assignment of service-account for launcher #1898

Closed

1 task

andreyvelich changed the title ~~[WIP] [SDK] Consolidate Naming for CRUD APIs~~ [SDK] Consolidate Naming for CRUD APIs Sep 8, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Sep 8, 2023

andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from 4adf8e4 to 19bb327 Compare September 8, 2023 22:13

tenzen-y reviewed Sep 12, 2023

View reviewed changes

This was referenced Sep 12, 2023

Add Linter and Formatter for Python Files in CI/CD #1910

Closed

[SDK] Create Job from Template for all Job Kinds #1911

Open

tenzen-y reviewed Sep 13, 2023

View reviewed changes

andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch 2 times, most recently from df3d177 to fea6efb Compare September 14, 2023 23:20

andreyvelich added 10 commits September 16, 2023 09:30

Add Flake and Black Lint

cce11c1

Change SDK APIs

b2c7385

Update E2E tests

962fe00

Fix a few function parameters

b5875dc

Fix black format

670b3ff

Fix a few comments

fd84445

Fix conftest location

686cba5

Fix Job kind in tests

5c8f9e1

Fix client creation in test

7da001b

Fix namespace arg in get_job_conditions

4524f35

andreyvelich added 6 commits September 16, 2023 09:30

Rename timeout to wait_timeout

6ac19c2

Validate that Job is not set with other input parameters

e78dba2

Update black in developer guide

fe00a73

Remove pip_index_url validation

1f26015

Use locals to verify input

a9e3774

Print Job info when E2E fails

2f49954

andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from fea6efb to 2f49954 Compare September 16, 2023 08:30

Remove duplicated delete

ca02651

andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from 60cf3e6 to ca02651 Compare September 16, 2023 10:43

google-oss-prow bot removed the do-not-merge/hold label Sep 18, 2023

google-oss-prow bot assigned johnugeorge Sep 20, 2023

google-oss-prow bot added the lgtm label Sep 20, 2023

google-oss-prow bot assigned tenzen-y Sep 21, 2023

tenzen-y approved these changes Sep 22, 2023

View reviewed changes

google-oss-prow bot added the approved label Sep 22, 2023

google-oss-prow bot merged commit bb2b58a into kubeflow:master Sep 22, 2023
57 checks passed

andreyvelich deleted the issue-1877-consolidate-sdk-apis branch September 22, 2023 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Consolidate Naming for CRUD APIs #1907

[SDK] Consolidate Naming for CRUD APIs #1907

andreyvelich commented Sep 7, 2023 •

edited

Loading

review-notebook-app bot commented Sep 8, 2023

andreyvelich commented Sep 8, 2023

andreyvelich commented Sep 8, 2023

coveralls commented Sep 8, 2023 •

edited

Loading

tenzen-y left a comment

tenzen-y Sep 13, 2023

andreyvelich Sep 13, 2023

tenzen-y Sep 14, 2023

andreyvelich Sep 14, 2023

tenzen-y Sep 14, 2023

andreyvelich commented Sep 14, 2023 •

edited

Loading

andreyvelich commented Sep 16, 2023

andreyvelich commented Sep 18, 2023

tenzen-y commented Sep 18, 2023

andreyvelich commented Sep 18, 2023

johnugeorge commented Sep 20, 2023

johnugeorge commented Sep 21, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

tenzen-y left a comment

google-oss-prow bot commented Sep 22, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

[SDK] Consolidate Naming for CRUD APIs #1907

[SDK] Consolidate Naming for CRUD APIs #1907

Conversation

andreyvelich commented Sep 7, 2023 • edited Loading

review-notebook-app bot commented Sep 8, 2023

andreyvelich commented Sep 8, 2023

andreyvelich commented Sep 8, 2023

coveralls commented Sep 8, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6206887875

💛 - Coveralls

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y Sep 13, 2023

Choose a reason for hiding this comment

andreyvelich Sep 13, 2023

Choose a reason for hiding this comment

tenzen-y Sep 14, 2023

Choose a reason for hiding this comment

andreyvelich Sep 14, 2023

Choose a reason for hiding this comment

tenzen-y Sep 14, 2023

Choose a reason for hiding this comment

andreyvelich commented Sep 14, 2023 • edited Loading

andreyvelich commented Sep 16, 2023

andreyvelich commented Sep 18, 2023

tenzen-y commented Sep 18, 2023

andreyvelich commented Sep 18, 2023

johnugeorge commented Sep 20, 2023

johnugeorge commented Sep 21, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Sep 22, 2023

andreyvelich commented Sep 22, 2023

tenzen-y commented Sep 22, 2023

andreyvelich commented Sep 7, 2023 •

edited

Loading

coveralls commented Sep 8, 2023 •

edited

Loading

andreyvelich commented Sep 14, 2023 •

edited

Loading