Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp how Tekton pipelines to run notebooks work. #703

Merged
merged 6 commits into from
Jun 29, 2020

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Jun 24, 2020

Notebook tests should build a docker image to run the notebook in.

  • Tekton tasks and pipelines to run notebooks and generate reports #613 currently the way we run notebook tests is
    by firing off a K8s job on the KF cluster which runs the notebook.

    • The K8s job uses init containers to pull in source code and install
      dependencies like papermill.

    • This is a bit brittle.

  • To fix this we will instead use Tekton to build a docker image that
    takes the notebook image and then adds the notebook code to it.

    • Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

  1. A Tekton Task to build a docker image to run the notebook in

  2. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

  • tekton_client should provide methods to upload artifacts but not parse
    junits

  • Add a tekton_client method to construct the full image URL based on
    the digest returned from kaniko

  • Copy over the code for running the notebook tests from kubeflow/examples
    and start modifying it.

  • Create a simple CLI to wait for nomos to sync resources to the cluster

    • This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: #613

Notebook tests should build a docker image to run the notebook in.

* kubeflow#613 currently the way we run notebook tests is
  by firing off a K8s job on the KF cluster which runs the notebook.

  * The K8s job uses init containers to pull in source code and install
    dependencies like papermill.

  * This is a bit brittle.

* To fix this we will instead use Tekton to build a docker image that
  takes the notebook image and then adds the notebook code to it.

  * Dockerfile.notebook_runner dockerfile to build the test image.

The pipeline to run the notebook consists of two tasks

  1. A Tekton Task to build a docker image to run the notebook in

  1. A tekton task that fires off a K8s job to run the notebook on the Kubeflow cluster.

Here's a list of changes to make this work

* tekton_client should provide methods to upload artifacts but not parse
  junits

* Add a tekton_client method to construct the full image URL based on
  the digest returned from kaniko

* Copy over the code for running the notebook tests from kubeflow/examples
and start modifying it.

* Create a simple CLI to wait for nomos to sync resources to the cluster
  * This is used in some syntactic sugar make rules to aid the dev-test loop

The mnist test isn't completing successfully yet because GoogleCloudPlatform/kubeflow-distribution#61 means the KF
deployments don't have proper GSA's to write to GCS.

Related to: kubeflow#613
@kubeflow-bot
Copy link

This change is Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented Jun 24, 2020

/assign @Bobgy
/assign @NikeNano

@jlewi
Copy link
Contributor Author

jlewi commented Jun 24, 2020

/test all

@Bobgy
Copy link
Contributor

Bobgy commented Jun 24, 2020

/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented Jun 24, 2020

/test all

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 24, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Jun 24, 2020

/lgtm

@@ -48,7 +48,7 @@ spec:
value: /workspace/kubeconfig
- name: PYTHONPATH
value: /workspace/$(inputs.resources.testing-repo.name)/py
image: gcr.io/kubeflow-ci/test-worker-py3:6f0d932-dirty@sha256:06ebe5412d638e3e51bdd792aecbafdc4ee1e7146ff367a7be346cd726738cbb
image: gcr.io/kubeflow-ci/test-worker-py3:3780b5d-dirty@sha256:4a766d6f5cc6cbcb00dbc96205f7a5b2816bc5f2b6d516fd67124d4a3e6508ea
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with building these images? Are the git sha hard coded or are these auto updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images are currently built using skaffold. We use a CLI option with skaffold to emit the URL of the image to a json file. We then use kpt to change the images to point at the new image. There's a make rule to provide syntactic sugar to string these commands together.

Ideally we would automate this so that on postsubmit new images would be automatically built and a PR opened to update all the images.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the explanation!

@@ -0,0 +1,62 @@
// nomos-wait is a simple tool to wait until nomos has been sync'd to the current commit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is noms and what is it used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nomos is Google's GitOps tooling
https://cloud.google.com/anthos-config-management/docs

We need to wait for the actual K8s resources to be sync'd from the Git repo to the cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +17 to +19
expected := "79629ca7"
if commit != expected {
t.Errorf("Got commit: %v; want %v", commit, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/question: could EqualErrorf be used instead?

@@ -5,17 +5,10 @@ that we use to run a bunch of our test and release scripts.

## To update the test worker images used in the Tekton tasks

1. Build a new image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this done now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow; its done using the make command listed here which performs the steps I mentioned above and which were listed here in the README.

# The YAML is modified by nb_test_util.py to generate a Job specific
# to a notebook.
#
# TODO(jlewi): We should switch to using Tekton
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand this PR set is up to use Tekton so this should be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. This is about how we run the notebook on the actual KF cluster. We are currently using a K8s job; we could potentially use a Tekton task if we install Tekton on KF clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had misunderstood it, thanks for the clarification!

"""Get a stack driver link for the job with the specified name."""
logs_filter = f"""resource.type="k8s_container"
labels."k8s-pod/job-name" = "{job_name}"
"""
Copy link
Member

@NikeNano NikeNano Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that. """ is a literal string. So if we indent line 25 we will end up including some extra whitespace in the value which we don't necessarily want.

@@ -15,34 +15,44 @@ spec:
- name: test-target-name
value: manual-testinig
- name: artifacts-gcs
value: gs://kubeflow-ci-deployment/gabrielwen-testing-2
value: gs://kubeflow-ci_temp/jlewi_mnist_testing/2020-0619
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be used as default or should we change, looks like something used during development.

- name: testing-repo
resourceSpec:
type: git
params:
- name: url
value: https://github.com/kubeflow/testing.git
value: https://github.com/jlewi/testing.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this point to Kubeflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing-repo shouldn't be a resource anymore.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 26, 2020

/test all

It was the py unittests that failed.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 26, 2020

Tests passed on retry; looks like some flake. I couldn't find any pod logs for the failed run.

resource.type="k8s_container"
resource.labels.pod_name="kubeflow-testing-presubmit-py-unittests-703-90f099a-3953-31b4-3224080807"

So not sure what happened.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 26, 2020
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 26, 2020

@NikeNano I addressed all your comments; PTAL.

  from being copied and thus the results from showing up in testgrid
  see kubeflow#703
@NikeNano
Copy link
Member

@NikeNano I addressed all your comments; PTAL.

LGTM

@jlewi
Copy link
Contributor Author

jlewi commented Jun 29, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor Author

jlewi commented Jun 29, 2020

@NikeNano you need to use the chatbot command "/lgtm"

@jlewi jlewi added the lgtm label Jun 29, 2020
@k8s-ci-robot k8s-ci-robot merged commit 0f0271a into kubeflow:master Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants