Revamp how Tekton pipelines to run notebooks work. #703

jlewi · 2020-06-24T00:09:22Z

Notebook tests should build a docker image to run the notebook in.

Tekton tasks and pipelines to run notebooks and generate reports #613 currently the way we run notebook tests is
by firing off a K8s job on the KF cluster which runs the notebook.
- The K8s job uses init containers to pull in source code and install
  dependencies like papermill.
- This is a bit brittle.
To fix this we will instead use Tekton to build a docker image that
takes the notebook image and then adds the notebook code to it.
- Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

A Tekton Task to build a docker image to run the notebook in
A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

tekton_client should provide methods to upload artifacts but not parse
junits
Add a tekton_client method to construct the full image URL based on
the digest returned from kaniko
Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.
Create a simple CLI to wait for nomos to sync resources to the cluster
- This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: #613

Notebook tests should build a docker image to run the notebook in. * kubeflow#613 currently the way we run notebook tests is by firing off a K8s job on the KF cluster which runs the notebook. * The K8s job uses init containers to pull in source code and install dependencies like papermill. * This is a bit brittle. * To fix this we will instead use Tekton to build a docker image that takes the notebook image and then adds the notebook code to it. * Dockerfile.notebook_runner dockerfile to build the test image. The pipeline to run the notebook consists of two tasks 1. A Tekton Task to build a docker image to run the notebook in 1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster. Here's a list of changes to make this work * tekton_client should provide methods to upload artifacts but not parse junits * Add a tekton_client method to construct the full image URL based on the digest returned from kaniko * Copy over the code for running the notebook tests from kubeflow/examples and start modifying it. * Create a simple CLI to wait for nomos to sync resources to the cluster * This is used in some syntactic sugar make rules to aid the dev-test loop The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF deployments don't have proper GSA's to write to GCS. Related to: kubeflow#613

kubeflow-bot · 2020-06-24T00:09:28Z

This change is

jlewi · 2020-06-24T00:09:41Z

/assign @Bobgy
/assign @NikeNano

…nning under python2.

jlewi · 2020-06-24T00:21:09Z

/test all

Bobgy · 2020-06-24T02:29:04Z

/lgtm

jlewi · 2020-06-24T03:02:57Z

/test all

Bobgy · 2020-06-24T04:30:56Z

/lgtm

NikeNano · 2020-06-24T05:38:30Z

acm-repos/kf-ci-v1/namespaces/auto-deploy/tekton.dev_v1alpha1_task_cleanup-kubeflow-ci.yaml

@@ -48,7 +48,7 @@ spec:
      value: /workspace/kubeconfig
    - name: PYTHONPATH
      value: /workspace/$(inputs.resources.testing-repo.name)/py
-    image: gcr.io/kubeflow-ci/test-worker-py3:6f0d932-dirty@sha256:06ebe5412d638e3e51bdd792aecbafdc4ee1e7146ff367a7be346cd726738cbb
+    image: gcr.io/kubeflow-ci/test-worker-py3:3780b5d-dirty@sha256:4a766d6f5cc6cbcb00dbc96205f7a5b2816bc5f2b6d516fd67124d4a3e6508ea


How does it work with building these images? Are the git sha hard coded or are these auto updated?

The images are currently built using skaffold. We use a CLI option with skaffold to emit the URL of the image to a json file. We then use kpt to change the images to point at the new image. There's a make rule to provide syntactic sugar to string these commands together.

Ideally we would automate this so that on postsubmit new images would be automatically built and a PR opened to update all the images.

Cool, thanks for the explanation!

NikeNano · 2020-06-24T05:43:51Z

go/cmd/nomos-wait/main.go

@@ -0,0 +1,62 @@
+// nomos-wait is a simple tool to wait until nomos has been sync'd to the current commit.


What is noms and what is it used for?

nomos is Google's GitOps tooling
https://cloud.google.com/anthos-config-management/docs

We need to wait for the actual K8s resources to be sync'd from the Git repo to the cluster

NikeNano · 2020-06-24T05:49:07Z

go/cmd/nomos-wait/main_test.go

+	expected := "79629ca7"
+	if commit != expected {
+		t.Errorf("Got commit: %v; want %v", commit, expected)


nit/question: could EqualErrorf be used instead?

NikeNano · 2020-06-24T05:49:48Z

images/README.md

@@ -5,17 +5,10 @@ that we use to run a bunch of our test and release scripts.

 ## To update the test worker images used in the Tekton tasks

-1. Build a new image.


How is this done now?

Not sure I follow; its done using the make command listed here which performs the steps I mentioned above and which were listed here in the README.

NikeNano · 2020-06-24T05:52:36Z

py/kubeflow/testing/notebook_tests/job.yaml

+# The YAML is modified by nb_test_util.py to generate a Job specific
+# to a notebook.
+#
+# TODO(jlewi): We should switch to using Tekton


As I understand this PR set is up to use Tekton so this should be fixed?

Not quite. This is about how we run the notebook on the actual KF cluster. We are currently using a K8s job; we could potentially use a Tekton task if we install Tekton on KF clusters.

I had misunderstood it, thanks for the clarification!

NikeNano · 2020-06-24T05:52:55Z

py/kubeflow/testing/notebook_tests/nb_test_util.py

+  """Get a stack driver link for the job with the specified name."""
+  logs_filter = f"""resource.type="k8s_container"
+   labels."k8s-pod/job-name" = "{job_name}"
+"""


Suggested change

"""

"""

I don't think we want to do that. """ is a literal string. So if we indent line 25 we will end up including some extra whitespace in the value which we don't necessarily want.

NikeNano · 2020-06-24T05:55:27Z

tekton/runs/nb-test-run.yaml

@@ -15,34 +15,44 @@ spec:
  - name: test-target-name
    value: manual-testinig
  - name: artifacts-gcs
-    value: gs://kubeflow-ci-deployment/gabrielwen-testing-2
+    value: gs://kubeflow-ci_temp/jlewi_mnist_testing/2020-0619


Should this be used as default or should we change, looks like something used during development.

NikeNano · 2020-06-24T05:56:15Z

tekton/runs/nb-test-run.yaml

  - name: testing-repo
    resourceSpec:
      type: git
      params:
      - name: url
-        value: https://github.com/kubeflow/testing.git
+        value: https://github.com/jlewi/testing.git


Should this point to Kubeflow?

testing-repo shouldn't be a resource anymore.

jlewi · 2020-06-26T20:29:03Z

/test all

It was the py unittests that failed.

jlewi · 2020-06-26T20:45:10Z

Tests passed on retry; looks like some flake. I couldn't find any pod logs for the failed run.

resource.type="k8s_container"
resource.labels.pod_name="kubeflow-testing-presubmit-py-unittests-703-90f099a-3953-31b4-3224080807"

So not sure what happened.

k8s-ci-robot · 2020-06-26T20:52:09Z

New changes are detected. LGTM label has been removed.

jlewi · 2020-06-26T21:00:36Z

@NikeNano I addressed all your comments; PTAL.

from being copied and thus the results from showing up in testgrid see kubeflow#703

NikeNano · 2020-06-28T19:59:24Z

@NikeNano I addressed all your comments; PTAL.

LGTM

jlewi · 2020-06-29T14:55:29Z

/approve

k8s-ci-robot · 2020-06-29T14:55:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2020-06-29T18:24:53Z

@NikeNano you need to use the chatbot command "/lgtm"

googlebot added the cla: yes label Jun 24, 2020

k8s-ci-robot requested review from Bobgy and rmgogogo June 24, 2020 00:09

k8s-ci-robot added the size/XXL label Jun 24, 2020

k8s-ci-robot assigned Bobgy and NikeNano Jun 24, 2020

tekton_client.py can't use format strings yet because we are still ru…

6b96600

…nning under python2.

Remove f-style strings.

1b4eb83

k8s-ci-robot added the lgtm label Jun 24, 2020

Fix typo.

90f099a

k8s-ci-robot removed the lgtm label Jun 24, 2020

k8s-ci-robot added the lgtm label Jun 24, 2020

NikeNano reviewed Jun 24, 2020

View reviewed changes

Address PR comments.

068625b

k8s-ci-robot removed the lgtm label Jun 26, 2020

* copy-buckets should not abort on error as this prevents artifacts

2857397

from being copied and thus the results from showing up in testgrid see kubeflow#703

k8s-ci-robot added the approved label Jun 29, 2020

jlewi added the lgtm label Jun 29, 2020

k8s-ci-robot merged commit 0f0271a into kubeflow:master Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp how Tekton pipelines to run notebooks work. #703

Revamp how Tekton pipelines to run notebooks work. #703

jlewi commented Jun 24, 2020

kubeflow-bot commented Jun 24, 2020

jlewi commented Jun 24, 2020

jlewi commented Jun 24, 2020

Bobgy commented Jun 24, 2020

jlewi commented Jun 24, 2020

Bobgy commented Jun 24, 2020

NikeNano Jun 24, 2020

jlewi Jun 26, 2020

NikeNano Jun 28, 2020

NikeNano Jun 24, 2020

jlewi Jun 26, 2020

NikeNano Jun 28, 2020

NikeNano Jun 24, 2020

NikeNano Jun 24, 2020

jlewi Jun 26, 2020

NikeNano Jun 24, 2020

jlewi Jun 26, 2020

NikeNano Jun 28, 2020

NikeNano Jun 24, 2020 •

edited

Loading

jlewi Jun 26, 2020

NikeNano Jun 24, 2020

NikeNano Jun 24, 2020

jlewi Jun 26, 2020

jlewi commented Jun 26, 2020

jlewi commented Jun 26, 2020

k8s-ci-robot commented Jun 26, 2020

jlewi commented Jun 26, 2020

NikeNano commented Jun 28, 2020

jlewi commented Jun 29, 2020

k8s-ci-robot commented Jun 29, 2020

jlewi commented Jun 29, 2020

		@@ -0,0 +1,62 @@
		// nomos-wait is a simple tool to wait until nomos has been sync'd to the current commit.

		@@ -5,17 +5,10 @@ that we use to run a bunch of our test and release scripts.

		## To update the test worker images used in the Tekton tasks

		1. Build a new image.

Revamp how Tekton pipelines to run notebooks work. #703

Revamp how Tekton pipelines to run notebooks work. #703

Conversation

jlewi commented Jun 24, 2020

kubeflow-bot commented Jun 24, 2020

jlewi commented Jun 24, 2020

jlewi commented Jun 24, 2020

Bobgy commented Jun 24, 2020

jlewi commented Jun 24, 2020

Bobgy commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NikeNano Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Jun 26, 2020

jlewi commented Jun 26, 2020

k8s-ci-robot commented Jun 26, 2020

jlewi commented Jun 26, 2020

NikeNano commented Jun 28, 2020

jlewi commented Jun 29, 2020

k8s-ci-robot commented Jun 29, 2020

jlewi commented Jun 29, 2020

NikeNano Jun 24, 2020 •

edited

Loading