Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

davidheryanto · 2019-11-27T07:13:49Z

feast.ingestion.ImportJobTest sometimes fail unpredictably because not all the keys ingested in Redis are found during retrieval, but the test log does not provide useful info for debugging the failure.

Example of such error:

runPipeline_ShouldWriteToRedisCorrectlyGivenValidSpecAndFeatureRow

  java.lang.AssertionError: 
  Key not found in Redis: ...

This pull request:

Adds a check whether the ingestion job has finished writing elements by checking the metric for no of elements written to the store periodically. Only when this counter no longer changes, the correctness of the ingested FeatureRow is evaluated. Previously the check is based on a fixed amount of duration (which is less reliable due to the the variability in the resources and input samples in the running environment)

The reason for using metric utility for this check is because the pipeline is a streaming pipeline with
unbounded source. Hence, the pipeline is always running or failed, i.e. no completed succesfully
state to check when it has ingested all the Feature Row.

Beam built in metric utility is used here (so it has no external dependencies on external metric collectors, such as statsd or prometheus) and this metric counts the no elements
WriteToStore transform has processed.
https://beam.apache.org/documentation/programming-guide/#metrics

Adds resources request for the Pod running the test. Previously, the resources that the test Pod get assigned depends on how overloaded the test cluster is. By having a guaranteed amount of CPU and memory, the duration of tests should be more predictable.
Adds more debugging info when such test fails:

Printing the output of redis INFO. This is useful for e.g. to check the count of keys ingested. If the count is fewer than expected, probably, we need to let ingestion job runs a bit longer because it hasn't fully ingested the complete sample input data.
Printing one random sample of element in Redis, to check that at least some FeatureRow is correctly ingested.

These should ensure ImportJobTest test result is reproducible.

Also print random Redis element to debug that some FeatureRow has been ingested properly

Some tests (like ingestion test) expect the operation to complete in certain amount of time. This can only be guaranteed if the process have adequate CPU and memory. Without it, when the test cluster is overloaded, the test process may get little CPU time allocated and the expected completion time is no longer valid

feast-ci-bot · 2019-11-27T07:13:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidheryanto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [davidheryanto]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2019-11-27T07:52:19Z

Thanks for this @davidheryanto

Should we perhaps rename the PR to be more descriptive? Its not clear that this ensures any less flakyness.

davidheryanto · 2019-11-27T08:24:19Z

/hold
Need to update PR title and consider tracking the ingestion progress.

So we can obtain information about no of elements have been written in the pipeline without resorting to external metrics collector This method makes use built in metrics util in Apache Beam

…he ingestion result in ImportJobTest

davidheryanto · 2019-11-28T08:00:01Z

/hold cancel
Updated PR title and add check for WriteToStore metrics

zhilingc · 2019-11-28T08:58:59Z

/lgtm

zhilingc · 2019-11-29T08:24:31Z

/retest

feast-ci-bot · 2019-11-30T19:17:48Z

@davidheryanto: Updated the config configmap in namespace default using the following files:

key config.yaml using file .prow/config.yaml

In response to this:

feast.ingestion.ImportJobTest sometimes fail unpredictably because not all the keys ingested in Redis are found during retrieval, but the test log does not provide useful info for debugging the failure.

Example of such error:
runPipeline_ShouldWriteToRedisCorrectlyGivenValidSpecAndFeatureRow

 java.lang.AssertionError: 
 Key not found in Redis: ...
This pull request:

Adds a check whether the ingestion job has finished writing elements by checking the metric for no of elements written to the store periodically. Only when this counter no longer changes, the correctness of the ingested FeatureRow is evaluated. Previously the check is based on a fixed amount of duration (which is less reliable due to the the variability in the resources and input samples in the running environment)

The reason for using metric utility for this check is because the pipeline is a streaming pipeline with
unbounded source. Hence, the pipeline is always running or failed, i.e. no completed succesfully
state to check when it has ingested all the Feature Row.

Beam built in metric utility is used here (so it has no external dependencies on external metric collectors, such as statsd or prometheus) and this metric counts the no elements
WriteToStore transform has processed.
https://beam.apache.org/documentation/programming-guide/#metrics

Adds resources request for the Pod running the test. Previously, the resources that the test Pod get assigned depends on how overloaded the test cluster is. By having a guaranteed amount of CPU and memory, the duration of tests should be more predictable.

Adds more debugging info when such test fails:

Printing the output of redis INFO. This is useful for e.g. to check the count of keys ingested. If the count is fewer than expected, probably, we need to let ingestion job runs a bit longer because it hasn't fully ingested the complete sample input data.

Printing one random sample of element in Redis, to check that at least some FeatureRow is correctly ingested.

These should ensure ImportJobTest test result is reproducible.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* GitBook: [#332] Updating roadmap and adding stream push API docs * GitBook: [#334] Fix typo in stream ingestion docs and update other references to streaming

davidheryanto added 2 commits November 27, 2019 14:48

Print Redis INFO when key not found in ImportJobTest

b1bbbda

Also print random Redis element to debug that some FeatureRow has been ingested properly

davidheryanto requested review from pradithya, thirteen37, tims, woop and zhilingc as code owners November 27, 2019 07:13

feast-ci-bot added approved size/M labels Nov 27, 2019

feast-ci-bot added the do-not-merge/hold label Nov 27, 2019

davidheryanto added 2 commits November 28, 2019 14:54

Add metric WriteToStore:elements_written

36ac0f9

So we can obtain information about no of elements have been written in the pipeline without resorting to external metrics collector This method makes use built in metrics util in Apache Beam

Add a check for all elements to be written to store before checking t…

55fc69e

…he ingestion result in ImportJobTest

feast-ci-bot added size/L and removed size/M labels Nov 28, 2019

Merge branch 'master' into import-job-test-error-logging

eb83163

davidheryanto changed the title ~~Ensure ImportJobTest is not flaky~~ Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing Nov 28, 2019

Merge branch 'master' into import-job-test-error-logging

d17a28d

feast-ci-bot removed the do-not-merge/hold label Nov 28, 2019

feast-ci-bot assigned zhilingc Nov 28, 2019

feast-ci-bot added the lgtm label Nov 28, 2019

feast-ci-bot merged commit 60db2ff into feast-dev:master Nov 30, 2019

gitbook-com bot pushed a commit that referenced this pull request Nov 6, 2021

GitBook: [#332] Updating roadmap and adding stream push API docs

d8bd8cf

adchia added a commit that referenced this pull request Nov 8, 2021

Adding stream ingestion alpha documentation (#2005)

18615f7

* GitBook: [#332] Updating roadmap and adding stream push API docs * GitBook: [#334] Fix typo in stream ingestion docs and update other references to streaming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

davidheryanto commented Nov 27, 2019 •

edited

Loading

feast-ci-bot commented Nov 27, 2019

woop commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

davidheryanto commented Nov 28, 2019

zhilingc commented Nov 28, 2019

zhilingc commented Nov 29, 2019

feast-ci-bot commented Nov 30, 2019

Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

Ensure ImportJobTest is not flaky by checking WriteToStore metric and requesting adequate resources for testing #332

Conversation

davidheryanto commented Nov 27, 2019 • edited Loading

feast-ci-bot commented Nov 27, 2019

woop commented Nov 27, 2019

davidheryanto commented Nov 27, 2019

davidheryanto commented Nov 28, 2019

zhilingc commented Nov 28, 2019

zhilingc commented Nov 29, 2019

feast-ci-bot commented Nov 30, 2019

davidheryanto commented Nov 27, 2019 •

edited

Loading