-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring/Telemetry] Force collectors to indicate when they are ready #36153
[Monitoring/Telemetry] Force collectors to indicate when they are ready #36153
Conversation
💔 Build Failed |
💔 Build Failed |
Pinging @elastic/stack-monitoring |
💔 Build Failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes for the upgrade assistant collector LGTM
💔 Build Failed |
💚 Build Succeeded |
retest |
💚 Build Succeeded |
retest |
💚 Build Succeeded |
retest |
💚 Build Succeeded |
retest |
💚 Build Succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces changes LGTM!
@crob611 It looks like you approved the PR in a comment, but would you mind going through the official approval phase so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canvas looks good
…dy (elastic#36153) * Initial code to force collectors to indicate when they are ready * Add and fix tests * Remove debug * Add ready check in api call * Fix prettier complaints * Return 503 if not all collectors are ready * PR feedback * Add retry logic for usage collection in the reporting tests * Fix incorrect boomify usage * Fix more issues with the tests * Just add debug I guess * More debug * Try and handle this exception * Try and make the tests more defensive and remove console logs * Retry logic here too * Debug for the reporting tests failure * I don't like this, but lets see if it works * Move the retry logic into the collector set directly * Add support for this new collector * Localize this * This shouldn't be static on the class, but rather static for the entire runtime
…dy (#36153) (#36706) * Initial code to force collectors to indicate when they are ready * Add and fix tests * Remove debug * Add ready check in api call * Fix prettier complaints * Return 503 if not all collectors are ready * PR feedback * Add retry logic for usage collection in the reporting tests * Fix incorrect boomify usage * Fix more issues with the tests * Just add debug I guess * More debug * Try and handle this exception * Try and make the tests more defensive and remove console logs * Retry logic here too * Debug for the reporting tests failure * I don't like this, but lets see if it works * Move the retry logic into the collector set directly * Add support for this new collector * Localize this * This shouldn't be static on the class, but rather static for the entire runtime
👍 |
Resolves #35799
Some recent changes in the stack monitoring parity tests and the way we collect usage data within monitoring reintroduced the fact that we currently have a fair amount of nondeterminism in the way our bulk uploader and
/api/stats
endpoint fetch data from collectors. It's true that some collectors, notably the maps, visualizations, and reporting usage collectors as well as the ops stats collector, are not ready to report their data synchronously, but rather, need to wait for some async processes to finish before they are ready.Historically, this has not been a problem because on the next collection interval tick, we'd just try and collect from those usage collectors again and eventually we'd get the data. However, recent changes to our collector code are now ensuring we only collect from usage collectors once a day (it happens with the very first call to collect), instead of at the default monitoring collection interval. Because of this, we ran into a scenario where our parity tests fail because the very first internally collected monitoring document is fairly bare (see #35799) which is the result of certain collectors not being ready to report yet, and the bulk uploader or
/api/stats
endpoint having no idea this is the case.This PR fixes this. It requires that every collector (usage or stats) implements a custom async
isReady()
function that returns true or false. If any known collector is not ready, the bulk uploader will not send its payload to ES, and the/api/stats
endpoint will return a 503. To avoid conflicts with the recent changes with usage collection, if any collector is not ready when we try and collect, we effectively reset the flag to once again try and fetch from usage collectors. This shouldn't affect the performance benefits introduced by #34609 because theisReady()
check will not actually invoke the fetching of usage collectors.It's important to note that it was a conscious decision to introduce extra friction by requiring each collector implement it's own
isReady()
function. Every owner needs to really think about if it needs to implement this function with custom logic, or just return true.For this PR, I have updated each collector and implemented custom logic in the few we identified as needing it, but I also need each owner of the other collectors to weigh in if we need to apply custom logic or not. Hopefully, the team tagging feature in Github will properly identify all owners, but I will do a pass to ensure everyone is notified here.
Testing
The easiest way to test this is to ensure that no
.monitoring-kibana-7-*
documents are lacking the fields noted in #35799, but most of the effort will be isolated to each owners specific usage data and ensuring it's in each and every.monitoring-kibana-7-*
document. This isn't meant to be time consuming, as it should only really affect the few first documents indexed once Kibana starts, but feel free to be as complete as you feel necessary.Questions/Concerns
not ready
state - we should probably have a way out of this situationSuggested Reviewers
To all suggested reviewers, if able, please verify that no special logic is necessary for your collector to be ready to collect the necessary data. Also, ensure that the collector is registered ASAP to avoid any timing issues where the first usage collection doesn't include your usage collector since it wasn't registered yet due to async code happening beforehand (this was true with the reporting code @tsullivan)