Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate status reporting on start #8836

Merged
merged 12 commits into from
Nov 28, 2023

Conversation

mwear
Copy link
Member

@mwear mwear commented Nov 9, 2023

Description:
This is part of the continued component status reporting effort. Currently we have automated status reporting for the following component lifecycle events: Starting, Stopping, Stopped as well as definitive errors that occur in the starting or stopping process (e.g. as determined by an error return value). This leaves the responsibility to the component to report runtime status after start and before stop. We'd like to be able to extend the automatic status reporting to report StatusOK if Start completes without an error. One complication with this approach is that some components spawn async work (via goroutines) that, depending on the Go scheduler, can report status before Start returns. As such, we cannot assume a nil return value from Start means the component has started properly. The solution is to detect if the component has already reported status when start returns, if it has, we will use the component-reported status and will not automatically report status. If it hasn't, and Start returns without an error, we can report StatusOK. Any subsequent reports from the component (async or otherwise) will transition the component status accordingly.

The tl;dr is that we cannot control the execution of async code, that's up to the Go scheduler, but we can handle the race, report the status based on the execution, and not clobber status reported from within the component during the startup process. That said, for components with async starts, you may see a StatusOK before the component-reported status, or just the component-reported status depending on the actual execution of the code. In both cases, the end status will be same.

The work in this PR will allow us to simplify #8684 and #8788 and ultimately choose which direction we want to go for runtime status reporting.

Link to tracking Issue: #7682

Testing: units / manual

@mwear mwear requested review from a team and bogdandrutu November 9, 2023 19:24
@mwear mwear force-pushed the automated-status-on-start branch 2 times, most recently from e5a06df to 1415fc5 Compare November 9, 2023 19:58
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the detailed PR description @mwear.

@mwear mwear force-pushed the automated-status-on-start branch 2 times, most recently from 641b268 to 34ef989 Compare November 9, 2023 22:37
Copy link

codecov bot commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (df6448b) 90.83% compared to head (8ebee4d) 91.57%.
Report is 59 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8836      +/-   ##
==========================================
+ Coverage   90.83%   91.57%   +0.74%     
==========================================
  Files         318      316       -2     
  Lines       17199    17147      -52     
==========================================
+ Hits        15622    15702      +80     
+ Misses       1284     1150     -134     
- Partials      293      295       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mwear
Copy link
Member Author

mwear commented Nov 9, 2023

The test failures in contrib are due to embedding TelemetrySettingsBase in TelemetrySettings. This PR as-is, will require some changes there. If that's not an option, I can do away with TelemetrySettingsBase altogether and duplicate code between component.TelemetrySettings and servicetelemetry.TelemetrySettings.

Edit: I went ahead and removed TelemetrySettingsBase and duplicated the common code between the component and service TelemetrySettings structs. Contrib tests are fine after this change.

One mutex is sufficient
This is mainly because it has a better zero value that requires fewer
modifications to existing code.
Embedding TelemetrySettingsBase is a bit of a pain and is causing
failures in contrib. The path of least resistance is to duplicate the
code shared between the component and service TelemetrySettings.
The flexibility of ReportComponentStatusIf invites misuse if the API
is not fully understood. In addition, we only use for the specific case
of conditionally reporting StatusOK if a component's current status is
Starting. This commit replaces ReportComponentStatusIf with
ReportComponentOKIfStarting which fulfills the requirements without
the potential for misuse.
component/telemetry.go Show resolved Hide resolved
service/service.go Show resolved Hide resolved
This is public API, so we need to follow the deprecation process.
component/telemetry.go Outdated Show resolved Hide resolved
@codeboten codeboten merged commit 433f7ae into open-telemetry:main Nov 28, 2023
32 checks passed
@github-actions github-actions bot added this to the next release milestone Nov 28, 2023
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this pull request Dec 8, 2023
This is part of the continued component status reporting effort.
Currently we have automated status reporting for the following component
lifecycle events: `Starting`, `Stopping`, `Stopped` as well as
definitive errors that occur in the starting or stopping process (e.g.
as determined by an error return value). This leaves the responsibility
to the component to report runtime status after start and before stop.
We'd like to be able to extend the automatic status reporting to report
`StatusOK` if `Start` completes without an error. One complication with
this approach is that some components spawn async work (via goroutines)
that, depending on the Go scheduler, can report status before `Start`
returns. As such, we cannot assume a nil return value from `Start` means
the component has started properly. The solution is to detect if the
component has already reported status when start returns, if it has, we
will use the component-reported status and will not automatically report
status. If it hasn't, and `Start` returns without an error, we can
report `StatusOK`. Any subsequent reports from the component (async or
otherwise) will transition the component status accordingly.

The tl;dr is that we cannot control the execution of async code, that's
up to the Go scheduler, but we can handle the race, report the status
based on the execution, and not clobber status reported from within the
component during the startup process. That said, for components with
async starts, you may see a `StatusOK` before the component-reported
status, or just the component-reported status depending on the actual
execution of the code. In both cases, the end status will be same.

The work in this PR will allow us to simplify open-telemetry#8684 and open-telemetry#8788 and
ultimately choose which direction we want to go for runtime status
reporting.

**Link to tracking Issue:** open-telemetry#7682

**Testing:** units / manual

---------

Co-authored-by: Alex Boten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants