Upgrades in 1.1 should follow kustomize off the shelf workflow #304

jlewi · 2020-04-11T22:25:00Z

Filing this issue to track simplifying the upgrade process in Kubeflow 1.1.

Here's the current instructions for how Kubeflow upgrades are done.
https://www.kubeflow.org/docs/upgrading/upgrade/

This differs from the standard off the shelf workflow for kustomize applications
https://github.com/kubernetes-sigs/kustomize/blob/master/docs/workflows.md#off-the-shelf-configuration

In particular, we introduce a KFUpgrade resource which defines pointers to the old and new KFDef.
https://www.kubeflow.org/docs/upgrading/upgrade/#upgrade-instructions

kfctl then does a lot of a magic in order to try to reapply any user defined kustomizations ontop of the new configs.

With the new kustomize patterns (http://bit.ly/kf_kustomize_v3) we should be able to simplify this and I think eliminate the need for kfctl. Instead users should be able to just

Update .cache to point to a new version of the kubeflow/manifests directory
Run kustomize build to regenerate the package.

This is because the new pattern with stacks is that kfctl generates a new kustomize package using Kubeflow defined packages in .cache as the base. So a user can regenerate .cache without losing any of their kustomizations.

There are a couple of issues that we run into when applying the updated manifests

Pruning - how do we cleanup resources from earlier versions of Kubeflow that are no longer in the latest Kubeflow resource
Updating immutable fields - Certain fields are immutable and will cause errors when apply is called.

Rather than rely on kfctl logic to solve these problems we should follow a shift left pattern. Our expectation should be that we rely on existing tools (e.g. kubectl, kpt, etc...) to apply the manifests and handle these problems.

kpt for example supports pruning

/cc @richardsliu @yanniszark @kunmingg

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-04-11T22:25:10Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.72

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi · 2020-04-20T14:48:32Z

As noted in kubeflow/kubeflow#4873; kustomize commonLabels should only be used for immutable labels
https://kubectl.docs.kubernetes.io/pages/app_management/labels_and_annotations.html

As these are used in selectors which are immutable.

Right now our applications are using version in the version and instance label which is used in selector and set via commonLabels.
https://github.com/kubeflow/manifests/blob/abc6898ba535515e88846e7cc97faa208ffdacb9/jupyter/jupyter-web-app/overlays/application/kustomization.yaml#L11

We need to fix this so that labels will be immutable across version updates.

It looks like kubectl has alpha support in prune.

  # Note: --prune is still in Alpha
  # Apply the configuration in manifest.yaml that matches label app=nginx and delete all the other resources that are
not in the file and match label app=nginx.
  kubectl apply --prune -f manifest.yaml -l app=nginx

So if we have appropriate, immutable, labels for each application then we should be able to use kubectl to apply a new version and prune any removed resources.

* Fix kubeflow#1131 * kustomize commonLabels get subsituted into selector fields. Selector fields are immutable. So if commonLabels change (e.g. between versions) then we can't reapply/update the existing resources which breaks upgrades (kubeflow/kfctl#304) * For the most part the problematic commonLabels were on our Application resources. The following labels were being set "app.kubernetes.io/version" "app.kubernetes.io/instance" "app.kubernetes.io/managed-by" "app.kubernetes.io/part-of" * Version was definetely changing between versions. instance was also changing between versions to include the version number. * managed-by and part-of could also change (e.g. we may not be using kfctl) * We could still set these labels if we wanted to; we just shouldn't set them as commonLabels and/or include them in the selector as the will inhibit upgrades with kubectl apply. * I created a test validate_resources_test.go to ensure none of these labels are included in commonLabels * I created a simple go binary tools/fix_common_labels.go to update all the resources. * generat_tests.py - Delete the code to remove unmatched tests. * We no longer generate tests that way and the delete code was going to delete valid tests like our new validation test * Get rid of the clean rule in the Makefile for the same reason.

* Fix #1131 * kustomize commonLabels get subsituted into selector fields. Selector fields are immutable. So if commonLabels change (e.g. between versions) then we can't reapply/update the existing resources which breaks upgrades (kubeflow/kfctl#304) * For the most part the problematic commonLabels were on our Application resources. The following labels were being set "app.kubernetes.io/version" "app.kubernetes.io/instance" "app.kubernetes.io/managed-by" "app.kubernetes.io/part-of" * Version was definetely changing between versions. instance was also changing between versions to include the version number. * managed-by and part-of could also change (e.g. we may not be using kfctl) * We could still set these labels if we wanted to; we just shouldn't set them as commonLabels and/or include them in the selector as the will inhibit upgrades with kubectl apply. * I created a test validate_resources_test.go to ensure none of these labels are included in commonLabels * I created a simple go binary tools/fix_common_labels.go to update all the resources. * generat_tests.py - Delete the code to remove unmatched tests. * We no longer generate tests that way and the delete code was going to delete valid tests like our new validation test * Get rid of the clean rule in the Makefile for the same reason.

jbottum · 2020-05-21T00:48:09Z

@jlewi Hey Jeremy - will this feature be included in Kubeflow 1.1 ?

* This is GCP specific code that allows CloudEndpoints to be created using the CloudEndpoint controller. A Cloud endpoint is a KRM style resource so we can kust have `kfctl apply -f {path}` invoke the appropriate logic. * For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when deploying private GKE the CloudEndpoints controller won't be able to contact the servicemanagement API. This provides a work around by running it locally. * This pattern seems extensible; i.e. other platforms could link in code to handle CR's specific to their platforms. This could basically be an alternative to plugins. * I added a context flag to control the kubecontext that apply applies to. Unfortunately, it doesn't look like there is an easy way to use that in the context of applying KFDef. It looks like the current logic assumes the cluster will be added to the KFDef metadata and then look up that cluster in .kubeconfig. * Modifying that logic to support the context flag seemed riskier then simply adding a comment to the flag. * Added some warnings that KFUpgrade is deprecated since per kubeflow#304 we want to follow the off shelf workflow.

* This is GCP specific code that allows CloudEndpoints to be created using the CloudEndpoint controller. A Cloud endpoint is a KRM style resource so we can kust have `kfctl apply -f {path}` invoke the appropriate logic. * For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when deploying private GKE the CloudEndpoints controller won't be able to contact the servicemanagement API. This provides a work around by running it locally. * This pattern seems extensible; i.e. other platforms could link in code to handle CR's specific to their platforms. This could basically be an alternative to plugins. * I added a context flag to control the kubecontext that apply applies to. Unfortunately, it doesn't look like there is an easy way to use that in the context of applying KFDef. It looks like the current logic assumes the cluster will be added to the KFDef metadata and then look up that cluster in .kubeconfig. * Modifying that logic to support the context flag seemed riskier then simply adding a comment to the flag. * Added some warnings that KFUpgrade is deprecated since per #304 we want to follow the off shelf workflow.

…low#351) * This is GCP specific code that allows CloudEndpoints to be created using the CloudEndpoint controller. A Cloud endpoint is a KRM style resource so we can kust have `kfctl apply -f {path}` invoke the appropriate logic. * For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when deploying private GKE the CloudEndpoints controller won't be able to contact the servicemanagement API. This provides a work around by running it locally. * This pattern seems extensible; i.e. other platforms could link in code to handle CR's specific to their platforms. This could basically be an alternative to plugins. * I added a context flag to control the kubecontext that apply applies to. Unfortunately, it doesn't look like there is an easy way to use that in the context of applying KFDef. It looks like the current logic assumes the cluster will be added to the KFDef metadata and then look up that cluster in .kubeconfig. * Modifying that logic to support the context flag seemed riskier then simply adding a comment to the flag. * Added some warnings that KFUpgrade is deprecated since per kubeflow#304 we want to follow the off shelf workflow.

Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304) Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)

* Update ROADMAP.md I updated Kubeflow 1.1, added Kubeflow 1.2 and Kubelfow 1.3 roadmap items. * Update ROADMAP.md Improved wording of features to simplify understanding * Update ROADMAP.md Added details on KFServing 0.5 enhancements * Update ROADMAP.md updated the notebooks section in Kubeflow 1.3 with these modificiations, * Notebooks * Important backend updates to Notebooks (i.e. to improve interop with Tensorboard) * New and expanded Jupyter Notebook stack along with easy to customize common base images * Addition of R-Studio and Code-Server (VS-Code) support * Update ROADMAP.md Reorganized Working Group updates into 1st section. added that customizing jupyter base image is a stretch feature * Update ROADMAP.md Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [#304](kubeflow/kfctl#304) Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD) * Update ROADMAP.md Added Multi-Model Serving, https://github.com/yuzliu/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md to KFServing 0.5 roadmap items

* Update ROADMAP.md I updated Kubeflow 1.1, added Kubeflow 1.2 and Kubelfow 1.3 roadmap items. * Update ROADMAP.md Improved wording of features to simplify understanding * Update ROADMAP.md Added details on KFServing 0.5 enhancements * Update ROADMAP.md updated the notebooks section in Kubeflow 1.3 with these modificiations, * Notebooks * Important backend updates to Notebooks (i.e. to improve interop with Tensorboard) * New and expanded Jupyter Notebook stack along with easy to customize common base images * Addition of R-Studio and Code-Server (VS-Code) support * Update ROADMAP.md Reorganized Working Group updates into 1st section. added that customizing jupyter base image is a stretch feature * Update ROADMAP.md Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304) Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD) * Update ROADMAP.md Added Multi-Model Serving, https://github.com/yuzliu/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md to KFServing 0.5 roadmap items

jlewi added the priority/p1 label Apr 11, 2020

issue-label-bot bot added the kind/feature label Apr 11, 2020

jlewi added area/kfctl and removed kind/feature labels Apr 11, 2020

This was referenced Apr 11, 2020

updated Kubeflow roadmap kubeflow/kubeflow#4909

Merged

kubeflow upgrade: support or deprecate legacy CRs based on k8s version & components. kubeflow/kubeflow#4150

Closed

jlewi mentioned this issue Apr 20, 2020

Upgrade from 0.7.1 to 1.0.1: spec.selector field is immutable kubeflow/kubeflow#4873

Closed

This was referenced Apr 21, 2020

Add v1alpha1 XGBoostJob manifests kubeflow/manifests#1113

Merged

commonLabels need to be immutable for upgrades - remove version from commonLabels kubeflow/manifests#1131

Closed

jlewi mentioned this issue Apr 29, 2020

commonLabels need to be immutable to support upgrading. kubeflow/manifests#1140

Merged

This was referenced May 15, 2020

Lots of resources have status fields which breaks ACM kubeflow/manifests#1174

Closed

Release demo script for 1.1 kubeflow/kubeflow#5021

Closed

jbottum added the kind/feature label May 21, 2020

jlewi mentioned this issue Jun 5, 2020

Add the ability to create CloudEndpoint resources using kfctl. #351

Merged

discordianfish mentioned this issue Aug 24, 2020

Support dynamical updates to the jupyter-web-app's image list kubeflow/kubeflow#4982

Closed

jlewi mentioned this issue Sep 17, 2020

Improve repo structure and delete outdated manifests kubeflow/manifests#1554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

jlewi commented Apr 11, 2020

issue-label-bot bot commented Apr 11, 2020

jlewi commented Apr 20, 2020 •

edited

Loading

jbottum commented May 21, 2020

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

Comments

jlewi commented Apr 11, 2020

issue-label-bot bot commented Apr 11, 2020

jlewi commented Apr 20, 2020 • edited Loading

jbottum commented May 21, 2020

jlewi commented Apr 20, 2020 •

edited

Loading