-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Fix crash in PeriodicExportingMetricReader
.
#2983
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2983 +/- ##
==========================================
+ Coverage 87.12% 87.60% +0.48%
==========================================
Files 200 190 -10
Lines 6109 5869 -240
==========================================
- Hits 5322 5141 -181
+ Misses 787 728 -59
|
OTEL_INTERNAL_LOG_ERROR( | ||
"[Periodic Exporting Metric Reader] Collect took longer configured time: " | ||
<< static_cast<PeriodicExportingMetricReader *>(keep_lifetime.get()) | ||
->export_timeout_millis_.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - same as earlier "do we need the typecast ?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep_lifetime
is std::shared_ptr<MetricReader>
but we need PeriodicExportingMetricReader*
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead make PeriodicExportingMetricReader
inherit from std::enable_shared_from_this<PeriodicExportingMetricReader>
, and remove for MetricReader
, if this allows avoiding typecast.
class PeriodicExportingMetricReader : public MetricReader, public std::enable_shared_from_this<PeriodicExportingMetricReader> {
// ... rest of the class ...
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it's a better solution for just a temporary fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so to clarify expectations:
Is the public std::enable_shared_from_this<PeriodicExportingMetricReader>
meant to be temporary, until a better solution is found ?
I do agree it fixes the immediate crash, so the code no longer keep a pointer to a local variable in dead memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, assuming this is a temporary fix.
@owent Can you please elaborate the issue? As I understand, we won't start a new collect if the previous is already ongoing. So timeout, shouldn't create a new thread. |
Thanks for the fix. Some nit comments, but in general looks good. |
The The coredump file of crashed process in my application show it has about 290+ threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. LGTM with nit comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
See comments, for a proposed different implementation.
Marking as request changes for now to discuss it,
will revise the review if it turns out to be not practical.
To add, my approval assumes this PR fixes the crash by keeping PeriodicExportingMetricReader alive during async ops, and thread explosion during timeouts will be tracked/addressed separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix.
Assuming this is a temporary solution just to prevent the crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI for this PR is consistently stuck, when executing the "exporter proto" CI unit tests.
All tests are executed, but the workflow does not stop.
I suspect that some thread is still blocked, either in the unit tests or in the exporter itself, that prevents the process to cleanly stop.
This happens only for this PR, so it does not look like a general issue.
Please investigate.
Even if some code needs to wait on something that never happens, it is desirable to wait with a timeout and retry, printing something in the internal log in each loop, so the code is easier to troubleshoot if this happens again.
Sorry, I'm too busy these days. I will continue this some timer later. |
I use another implementation to avoid some BUGs in STL. Please review it again. |
PeriodicExportingMetricReader
.PeriodicExportingMetricReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix.
Thanks also for the extra cleanup for include-what-you-use.
Fixes #2982
Changes
PeriodicExportingMetricReader
and a local variable when callingstd::async
.@lalitb @marcalff Just wondering why we use
std::async(std::launch::async, ...)
here? It will create a lot of thread when timeout occurs frequently, and it will use a lot of resource and slow down the whole application then.For significant contributions please make sure you have completed the following items:
CHANGELOG.md
updated for non-trivial changes