Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Testing] tfx sample pipeline broken in release not caught by postsubmit #5178

Closed
Bobgy opened this issue Feb 24, 2021 · 13 comments
Closed

[Testing] tfx sample pipeline broken in release not caught by postsubmit #5178

Bobgy opened this issue Feb 24, 2021 · 13 comments
Assignees
Labels
lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Feb 24, 2021

In #5137, tfx sample pipeline was broken, but it was not caught by postsubmit test.

IIRC our integration test should cover this sample, do we know why this is not captured?

Originally posted in #5137 (comment)

I think this is a high priority issue, because it causes a lot of efforts after 1.4.0 was released.

/cc @numerology @chensun

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 24, 2021

image

Notice that, postsubmit test before the release commit passed, however the released version had a problematic tfx sample pipeline (ignore the failure, it was expected for a release commit).

Integration test of the passing commit: https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1360157617023881216

What's even weirder, after the fix #5165, postsubmit tests start to fail on parameterized_tfx_sample pipeline.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 24, 2021

One problem I found was that, the integration test in postsubmit is incorrectly using presubmit test script: https://github.com/GoogleCloudPlatform/oss-test-infra/blob/f29fc29cd617497ea44164ff6a1734c7dee3c0f4/prow/prowjobs/kubeflow/pipelines/kubeflow-pipelines-postsubmits.yaml#L50

EDIT: this is not root cause of this problem

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 24, 2021

I think I found the root cause, it's caused by incomplete upgrade of tfx dependencies.

Let me explain the order of events happening:

  1. pip started to fail for python 3.5 images, so we fixed image build by upgrading to tfx==0.26.0 in Upgrade tfx version to 0.26.0 in backend #5052.

    At this point, builtin sample is already failing, because tfx sdk version is bumped to 0.26.0, but tfx image is still at 0.22.0.
    However, postsubmit tests do not fail, because sample-test/requirements.in was still at tfx==0.22.0.

  2. we released KFP 1.4.0

  3. we got user report that the builtin tfx sample fails -- as expected

  4. we fixed builtin sample by updating tfx image: fix(samples): Update tfx image. Fixes #5137 #5165

    However, postsubmit tests start to fail for tfx sample, because tfx image is now at 0.27.0 while sample-test/requirements.in was still at tfx==0.22.0.

Conclusion

  • We should clearly document how to upgrade TFX dependency in KFP, so that people do not mistakenly only upgrade a subset of them.
  • It'll be more ideal if tfx version for the multiple places have a single source of truth, or they can be updated programmatically using a script. A quick action we can do is probably letting sample-test/requirements.in derive from backend compiler requirements, so that they are always in-sync -- sample-test env is the same as prebuilt sample compilation env.

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 25, 2021

High priority problems fixed and I made samples-test to imports the same requirements.in from backend requirements.in.

What's missing is that, people may update requirements.in, but forget to update all requirements.txt.
EDIT: this still seems like a high priority and easy mistake.

/assign @chensun
Do you think you can take this?

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 25, 2021

and #5187 is pending review

@chensun
Copy link
Member

chensun commented Feb 25, 2021

High priority problems fixed and I made samples-test to imports the same requirements.in from backend requirements.in.

What's missing is that, people may update requirements.in, but forget to update all requirements.txt.
EDIT: this still seems like a high priority and easy mistake.

/assign @chensun
Do you think you can take this?

Aren't we deprecating requirements.in given sdk/python/requirements.txt covered by #5056?

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 25, 2021

@chensun for clarification, #5056 explicitly disabled renovate for python packages.

You can make the decision to enable it.

@chensun
Copy link
Member

chensun commented Feb 25, 2021

@chensun for clarification, #5056 explicitly disabled renovate for python packages.

You can make the decision to enable it.

I see, thanks for the clarification. Checked again, and it did disabled python. (I was under the wrong impression that sdk/python/requirements.txt was covered as I saw an ignore list with some components path yet sdk is not in that list).

@chensun chensun self-assigned this Feb 25, 2021
@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 25, 2021

I want to echo again what I said here. I think pip install -r sdk/python/requirements.txt doesn't represent the most common user journey -- think about our notebook samples, it only has pip install kfp or pip install kfp==<pinned version>, but I've rarely seen pip install -r sdk/python/requirements.txt.

I would suggest we move away from installing requirements.txt in tests. So the tests creates an environment closer to a fresh installation of pip install kfp. If there's a newly emerged dependency issue, we would probably be able to see it in tests.

P.S.: Taking tfx for example, their requirements.txt contains nothing but -e . (read from setup.py)

-- @chensun

Moving some of the discussion from the PR thread back to the main issue.

I agree with Chen, testing as what users would get is an important test. However, we used to do that before introducing requirements.{in/txt}, the result was that from time to time presubmit broke without any clues and we needed to dig through the dependencies to find out why.

I just want to make sure we are not going in cycles, we should admit approaches have their pros and cons.

I think the discussion laid out in #4682 is not significantly different from what we have here. Maybe the best approach is also to have a requirements.txt, but set up a bot to update it periodically as PRs. In this way, if that update PR fails, we know users might hit that problem too, but it won't be blocking presubmit tests (and other people not working on this problem).

@chensun
Copy link
Member

chensun commented Mar 8, 2021

However, we used to do that before introducing requirements.{in/txt}, the result was that from time to time presubmit broke without any clues and we needed to dig through the dependencies to find out why.

If I'm not mistaken, this is usually due to new versions of dependencies not compatible with other existing dependencies. And that's a sign that we need to fix setup.py by adding upper limit restrictions to the dependencies. Using requirements.txt in test is not solving the underlying issue but hiding it.

The fact that many of the dependencies listed in kfp setup.py don't have the upper version limit is problematic IMHO.

REQUIRES = [
'absl-py>=0.9,<=0.11',
'PyYAML>=5.3',
'google-cloud-storage>=1.13.0',
'kubernetes>=8.0.0, <12.0.0',
'google-auth>=1.6.1',
'requests_toolbelt>=0.8.0',
'cloudpickle',
# Update the upper version whenever a new major version of the
# kfp-server-api package is released.
# Update the lower version when kfp sdk depends on new apis/fields in
# kfp-server-api.
# Note, please also update ./requirements.in
'kfp-server-api>=1.1.2, <2.0.0',
'jsonschema >= 3.0.1',
'tabulate',
'click',
'Deprecated',
'strip-hints',
'docstring-parser>=0.7.3',
'kfp-pipeline-spec>=0.1.5, <0.2.0',
'fire>=0.3.1',
'protobuf'
]

So I think one action item is to add upper limits regardless whether we use requirements.txt in tests. WDYT?

EDIT: created #5258

@Bobgy
Copy link
Contributor Author

Bobgy commented Apr 16, 2021

TODO: update TFX upgrade documentation

@stale
Copy link

stale bot commented Jul 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 20, 2021
@stale
Copy link

stale bot commented Mar 3, 2022

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
None yet
Development

No branches or pull requests

2 participants