Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring/Telemetry] Force collectors to indicate when they are ready #36153

Merged

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented May 7, 2019

Resolves #35799

Some recent changes in the stack monitoring parity tests and the way we collect usage data within monitoring reintroduced the fact that we currently have a fair amount of nondeterminism in the way our bulk uploader and /api/stats endpoint fetch data from collectors. It's true that some collectors, notably the maps, visualizations, and reporting usage collectors as well as the ops stats collector, are not ready to report their data synchronously, but rather, need to wait for some async processes to finish before they are ready.

Historically, this has not been a problem because on the next collection interval tick, we'd just try and collect from those usage collectors again and eventually we'd get the data. However, recent changes to our collector code are now ensuring we only collect from usage collectors once a day (it happens with the very first call to collect), instead of at the default monitoring collection interval. Because of this, we ran into a scenario where our parity tests fail because the very first internally collected monitoring document is fairly bare (see #35799) which is the result of certain collectors not being ready to report yet, and the bulk uploader or /api/stats endpoint having no idea this is the case.

This PR fixes this. It requires that every collector (usage or stats) implements a custom async isReady() function that returns true or false. If any known collector is not ready, the bulk uploader will not send its payload to ES, and the /api/stats endpoint will return a 503. To avoid conflicts with the recent changes with usage collection, if any collector is not ready when we try and collect, we effectively reset the flag to once again try and fetch from usage collectors. This shouldn't affect the performance benefits introduced by #34609 because the isReady() check will not actually invoke the fetching of usage collectors.

It's important to note that it was a conscious decision to introduce extra friction by requiring each collector implement it's own isReady() function. Every owner needs to really think about if it needs to implement this function with custom logic, or just return true.

For this PR, I have updated each collector and implemented custom logic in the few we identified as needing it, but I also need each owner of the other collectors to weigh in if we need to apply custom logic or not. Hopefully, the team tagging feature in Github will properly identify all owners, but I will do a pass to ensure everyone is notified here.

Testing

The easiest way to test this is to ensure that no .monitoring-kibana-7-* documents are lacking the fields noted in #35799, but most of the effort will be isolated to each owners specific usage data and ensuring it's in each and every .monitoring-kibana-7-* document. This isn't meant to be time consuming, as it should only really affect the few first documents indexed once Kibana starts, but feel free to be as complete as you feel necessary.

Questions/Concerns

  • It's theoretically possible that we will no longer reporting any monitoring documents if some collector is stuck in a not ready state - we should probably have a way out of this situation

Suggested Reviewers

To all suggested reviewers, if able, please verify that no special logic is necessary for your collector to be ready to collect the necessary data. Also, ensure that the collector is registered ASAP to avoid any timing issues where the first usage collection doesn't include your usage collector since it wasn't registered yet due to async code happening beforehand (this was true with the reporting code @tsullivan)

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@chrisronline chrisronline marked this pull request as ready for review May 7, 2019 15:30
@chrisronline chrisronline requested review from a team as code owners May 7, 2019 15:30
@chrisronline chrisronline self-assigned this May 7, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@elasticmachine
Copy link
Contributor

💔 Build Failed

Copy link
Member

@mistic mistic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for the upgrade assistant collector LGTM

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Copy link
Member

@legrego legrego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces changes LGTM!

@chrisronline
Copy link
Contributor Author

@crob611 It looks like you approved the PR in a comment, but would you mind going through the official approval phase so the kibana-canvas requirement is satisfied? Thanks!

Copy link
Contributor

@crob611 crob611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canvas looks good

@chrisronline chrisronline merged commit c87e881 into elastic:master May 20, 2019
@chrisronline chrisronline deleted the monitoring/collector_is_ready branch May 20, 2019 17:04
chrisronline added a commit to chrisronline/kibana that referenced this pull request May 20, 2019
…dy (elastic#36153)

* Initial code to force collectors to indicate when they are ready

* Add and fix tests

* Remove debug

* Add ready check in api call

* Fix prettier complaints

* Return 503 if not all collectors are ready

* PR feedback

* Add retry logic for  usage collection in the reporting tests

* Fix incorrect boomify usage

* Fix more issues with the tests

* Just add debug I guess

* More debug

* Try and handle this exception

* Try and make the tests more defensive and remove console logs

* Retry logic here too

* Debug for the reporting tests failure

* I don't like this, but lets see if it works

* Move the retry logic into the collector set directly

* Add support for this new collector

* Localize this

* This shouldn't be static on the class, but rather static for the entire runtime
chrisronline added a commit that referenced this pull request May 20, 2019
…dy (#36153) (#36706)

* Initial code to force collectors to indicate when they are ready

* Add and fix tests

* Remove debug

* Add ready check in api call

* Fix prettier complaints

* Return 503 if not all collectors are ready

* PR feedback

* Add retry logic for  usage collection in the reporting tests

* Fix incorrect boomify usage

* Fix more issues with the tests

* Just add debug I guess

* More debug

* Try and handle this exception

* Try and make the tests more defensive and remove console logs

* Retry logic here too

* Debug for the reporting tests failure

* I don't like this, but lets see if it works

* Move the retry logic into the collector set directly

* Add support for this new collector

* Localize this

* This shouldn't be static on the class, but rather static for the entire runtime
@chrisronline
Copy link
Contributor Author

chrisronline commented May 20, 2019

Backport:

7.x: 2ac88ca
6.8: 12bbb5e

@tsullivan
Copy link
Member

It's important to note that it was a conscious decision to introduce extra friction by requiring each collector implement it's own isReady() function. Every owner needs to really think about if it needs to implement this function with custom logic, or just return true.

👍

chrisronline added a commit that referenced this pull request Jul 17, 2019
…re ready (#36153) (#41289)

* Backport c87e881 to 6.8

* Fix tests

* Add in missing functionality

* Add more missing code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metricbeat-indexed kibana_stats docs differ from internally-indexed ones