Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Unified training operator working progress #138

Open
Jeffwan opened this issue Jun 26, 2021 · 7 comments
Open

Unified training operator working progress #138

Jeffwan opened this issue Jun 26, 2021 · 7 comments
Assignees

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Jun 26, 2021

@zw0610 and I present all-in-one training operator proposal in last month community meeting.

WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4

Configuration and deployment

Description Category Status Issue
Kustomize package Required Done  
Application CR Required Not Done  
Images listed in kustomization.yaml Required Not Done  
Upgradeability Required Not Done  
Separate cluster scoped and namespace scoped resources Recommended Not Done N/A
Kustomize package should be deployable on its own Recommended Done  Need to coordinate with 1.4 release

Custom Resources

Description Category Status Issue
Version stability Required Not Done  
Backward compatibility Required Not Done  
Supports status subresource Required Done  All jobs have status to reflect the real status
CRD schema validation Required Not Done
Training operators follow kubeflow/common conventions Required Done kubeflow/training-operator#1296 kubeflow/training-operator#1295 kubeflow/training-operator#1294 kubeflow/training-operator#1293

Observability

Description Category Status Issue
Liveness/Readiness signals Required Not Done  
Prometheus metrics and Graphs Required Not Done
Job Events Required Not Done  
Json logging Recommended Not Done  

CI/CD

Description Category Status Issue
E2E tests Required Not Done  
Scalability / load testing Required Not Done  
Continuous building of docker images Recommended Not Done  kubeflow/testing#951
Continuous updating of Kustomize manifests Recommended Not Done  This is not valid anymore - kubeflow/manifests will fetch repo's kustomize manifest

Docs

Description Category Status Issue
API Reference docs Required Not Done  
Application docs Required Not Done  

Owners/Maintenance

Description Category Explanation Status Issue
Healthy number of committers and commits Required Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application Not Done  
At least 2 different organizations are committers Required Not Done  

Adoption

Description Category Explanation Status Issue
List of users running the application Recommended Suggest listing adopters willing to be identified publicly in ADOPTERS.md Not Done  
@Jeffwan
Copy link
Member Author

Jeffwan commented Jun 27, 2021

Things to figure out.

  1. code repo process, project name -> confirm with Bobby.
  2. tech stack? Kubebuilder version, Kubernetes version etc
  3. integration environments - Prow or Github Actions, Where to hold the images? Andrey
  4. API version management & clientset generation
  5. Development cycle

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 5, 2021

An update on above items. @zw0610 @kubeflow/wg-training-leads

  1. code repo process, project name -> confirm with Boggy.

reuse tf-operator and rename to kubeflow/training-operator. pending confirmation with Boggy.

all issues, commits, followers, start will be transferred to new repo.

  1. tech stack? Kubebuilder version, Kubernetes version etc

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

  1. integration environments - Prow or Github Actions, Where to hold the images?

reuse our PROW test jobs in all-in-one-operator branch. use AWS public images and CD for short term.

  1. API version management & clientset generation

Start from v1 API since we plan to reuse most of the existing specs in phase 1.
clients generation will be postposed until we see some other repos want to leverage it.

  1. Development cycle

use tf-operator separate develop branch (July 16) -> when features are all ready, merge back to master (2 weeks review by training leads) -> clean up code base (1 week) -> rename the repo (1month and catch 1.4 release)

We plan to have an alpha rc release by training & automl summit. (July 16).

@andreyvelich
Copy link
Member

Thank you for driving this @Jeffwan!

kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

clients generation will be postposed until we see some other repos want to leverage it.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 6, 2021

Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?

Yeah, this is flexible. Since current repo use lower version. We plan to have a 1.19 as a start and then jump to 1.21 once we merge back to master. Just in case someone user lower version and we want to have a tag or release for those users.

Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ?

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

@andreyvelich
Copy link
Member

Sounds good @Jeffwan.

Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them?

No, we are only using APIs from the TFJob: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L28 to validate TFJob, etc. But this also can be omitted from our side since it's not necessary. cc @kubeflow/wg-automl-leads

@johnugeorge
Copy link
Member

@Jeffwan Great. Can we merge code in phase as review will be easier?

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 7, 2021

@johnugeorge sure. I will cc all training leads for PRs coming into feature branch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants