Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

Merged

Conversation

davidheryanto
Copy link
Collaborator

@davidheryanto davidheryanto commented Nov 27, 2019

feast.ingestion.ImportJobTest sometimes fail unpredictably because not all the keys ingested in Redis are found during retrieval, but the test log does not provide useful info for debugging the failure.

Example of such error:

runPipeline_ShouldWriteToRedisCorrectlyGivenValidSpecAndFeatureRow

  java.lang.AssertionError: 
  Key not found in Redis: ...

This pull request:

  1. Adds a check whether the ingestion job has finished writing elements by checking the metric for no of elements written to the store periodically. Only when this counter no longer changes, the correctness of the ingested FeatureRow is evaluated. Previously the check is based on a fixed amount of duration (which is less reliable due to the the variability in the resources and input samples in the running environment)

The reason for using metric utility for this check is because the pipeline is a streaming pipeline with
unbounded source. Hence, the pipeline is always running or failed, i.e. no completed succesfully
state to check when it has ingested all the Feature Row.

Beam built in metric utility is used here (so it has no external dependencies on external metric collectors, such as statsd or prometheus) and this metric counts the no elements
WriteToStore transform has processed.
https://beam.apache.org/documentation/programming-guide/#metrics

  1. Adds resources request for the Pod running the test. Previously, the resources that the test Pod get assigned depends on how overloaded the test cluster is. By having a guaranteed amount of CPU and memory, the duration of tests should be more predictable.

  2. Adds more debugging info when such test fails:

  • Printing the output of redis INFO. This is useful for e.g. to check the count of keys ingested. If the count is fewer than expected, probably, we need to let ingestion job runs a bit longer because it hasn't fully ingested the complete sample input data.
  • Printing one random sample of element in Redis, to check that at least some FeatureRow is correctly ingested.

These should ensure ImportJobTest test result is reproducible.

Also print random Redis element to debug that some FeatureRow has been ingested properly
Some tests (like ingestion test) expect the operation to complete in certain amount of time. This can only be guaranteed if the process have adequate CPU and memory.
Without it, when the test cluster is overloaded, the test process may get little CPU time allocated and the expected completion time is no longer valid
@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidheryanto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Member

woop commented Nov 27, 2019

Thanks for this @davidheryanto

Should we perhaps rename the PR to be more descriptive? Its not clear that this ensures any less flakyness.

@davidheryanto
Copy link
Collaborator Author

/hold
Need to update PR title and consider tracking the ingestion progress.

So we can obtain information about no of elements have been written in the pipeline without resorting to external metrics collector
This method makes use built in metrics util in Apache Beam
@davidheryanto davidheryanto changed the title Ensure ImportJobTest is not flaky Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing Nov 28, 2019
@davidheryanto
Copy link
Collaborator Author

/hold cancel
Updated PR title and add check for WriteToStore metrics

@zhilingc
Copy link
Collaborator

/lgtm

@zhilingc
Copy link
Collaborator

/retest

@feast-ci-bot feast-ci-bot merged commit 60db2ff into feast-dev:master Nov 30, 2019
@feast-ci-bot
Copy link
Collaborator

@davidheryanto: Updated the config configmap in namespace default using the following files:

  • key config.yaml using file .prow/config.yaml

In response to this:

feast.ingestion.ImportJobTest sometimes fail unpredictably because not all the keys ingested in Redis are found during retrieval, but the test log does not provide useful info for debugging the failure.

Example of such error:

runPipeline_ShouldWriteToRedisCorrectlyGivenValidSpecAndFeatureRow

 java.lang.AssertionError: 
 Key not found in Redis: ...

This pull request:

  1. Adds a check whether the ingestion job has finished writing elements by checking the metric for no of elements written to the store periodically. Only when this counter no longer changes, the correctness of the ingested FeatureRow is evaluated. Previously the check is based on a fixed amount of duration (which is less reliable due to the the variability in the resources and input samples in the running environment)

The reason for using metric utility for this check is because the pipeline is a streaming pipeline with
unbounded source. Hence, the pipeline is always running or failed, i.e. no completed succesfully
state to check when it has ingested all the Feature Row.

Beam built in metric utility is used here (so it has no external dependencies on external metric collectors, such as statsd or prometheus) and this metric counts the no elements
WriteToStore transform has processed.
https://beam.apache.org/documentation/programming-guide/#metrics

  1. Adds resources request for the Pod running the test. Previously, the resources that the test Pod get assigned depends on how overloaded the test cluster is. By having a guaranteed amount of CPU and memory, the duration of tests should be more predictable.

  2. Adds more debugging info when such test fails:

  • Printing the output of redis INFO. This is useful for e.g. to check the count of keys ingested. If the count is fewer than expected, probably, we need to let ingestion job runs a bit longer because it hasn't fully ingested the complete sample input data.
  • Printing one random sample of element in Redis, to check that at least some FeatureRow is correctly ingested.

These should ensure ImportJobTest test result is reproducible.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adchia added a commit that referenced this pull request Nov 8, 2021
* GitBook: [#332] Updating roadmap and adding stream push API docs

* GitBook: [#334] Fix typo in stream ingestion docs and update other references to streaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants