Automate status reporting on start #8836

mwear · 2023-11-09T19:24:55Z

Description:
This is part of the continued component status reporting effort. Currently we have automated status reporting for the following component lifecycle events: Starting, Stopping, Stopped as well as definitive errors that occur in the starting or stopping process (e.g. as determined by an error return value). This leaves the responsibility to the component to report runtime status after start and before stop. We'd like to be able to extend the automatic status reporting to report StatusOK if Start completes without an error. One complication with this approach is that some components spawn async work (via goroutines) that, depending on the Go scheduler, can report status before Start returns. As such, we cannot assume a nil return value from Start means the component has started properly. The solution is to detect if the component has already reported status when start returns, if it has, we will use the component-reported status and will not automatically report status. If it hasn't, and Start returns without an error, we can report StatusOK. Any subsequent reports from the component (async or otherwise) will transition the component status accordingly.

The tl;dr is that we cannot control the execution of async code, that's up to the Go scheduler, but we can handle the race, report the status based on the execution, and not clobber status reported from within the component during the startup process. That said, for components with async starts, you may see a StatusOK before the component-reported status, or just the component-reported status depending on the actual execution of the code. In both cases, the end status will be same.

The work in this PR will allow us to simplify #8684 and #8788 and ultimately choose which direction we want to go for runtime status reporting.

Link to tracking Issue: #7682

Testing: units / manual

jmacd

Looks good to me. Thanks for the detailed PR description @mwear.

codecov · 2023-11-09T22:41:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (df6448b) 90.83% compared to head (8ebee4d) 91.57%.
Report is 59 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8836      +/-   ##
==========================================
+ Coverage   90.83%   91.57%   +0.74%     
==========================================
  Files         318      316       -2     
  Lines       17199    17147      -52     
==========================================
+ Hits        15622    15702      +80     
+ Misses       1284     1150     -134     
- Partials      293      295       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mwear · 2023-11-09T23:29:53Z

The test failures in contrib are due to embedding TelemetrySettingsBase in TelemetrySettings. This PR as-is, will require some changes there. If that's not an option, I can do away with TelemetrySettingsBase altogether and duplicate code between component.TelemetrySettings and servicetelemetry.TelemetrySettings.

Edit: I went ahead and removed TelemetrySettingsBase and duplicated the common code between the component and service TelemetrySettings structs. Contrib tests are fine after this change.

… method

…dComponent This fixes an existing bug

One mutex is sufficient

This is mainly because it has a better zero value that requires fewer modifications to existing code.

Embedding TelemetrySettingsBase is a bit of a pain and is causing failures in contrib. The path of least resistance is to duplicate the code shared between the component and service TelemetrySettings.

service/internal/graph/graph.go

The flexibility of ReportComponentStatusIf invites misuse if the API is not fully understood. In addition, we only use for the specific case of conditionally reporting StatusOK if a component's current status is Starting. This commit replaces ReportComponentStatusIf with ReportComponentOKIfStarting which fulfills the requirements without the potential for misuse.

component/telemetry.go

service/service.go

This is public API, so we need to follow the deprecation process.

component/telemetry.go

This is part of the continued component status reporting effort. Currently we have automated status reporting for the following component lifecycle events: `Starting`, `Stopping`, `Stopped` as well as definitive errors that occur in the starting or stopping process (e.g. as determined by an error return value). This leaves the responsibility to the component to report runtime status after start and before stop. We'd like to be able to extend the automatic status reporting to report `StatusOK` if `Start` completes without an error. One complication with this approach is that some components spawn async work (via goroutines) that, depending on the Go scheduler, can report status before `Start` returns. As such, we cannot assume a nil return value from `Start` means the component has started properly. The solution is to detect if the component has already reported status when start returns, if it has, we will use the component-reported status and will not automatically report status. If it hasn't, and `Start` returns without an error, we can report `StatusOK`. Any subsequent reports from the component (async or otherwise) will transition the component status accordingly. The tl;dr is that we cannot control the execution of async code, that's up to the Go scheduler, but we can handle the race, report the status based on the execution, and not clobber status reported from within the component during the startup process. That said, for components with async starts, you may see a `StatusOK` before the component-reported status, or just the component-reported status depending on the actual execution of the code. In both cases, the end status will be same. The work in this PR will allow us to simplify open-telemetry#8684 and open-telemetry#8788 and ultimately choose which direction we want to go for runtime status reporting. **Link to tracking Issue:** open-telemetry#7682 **Testing:** units / manual --------- Co-authored-by: Alex Boten <[email protected]>

mwear requested review from a team and bogdandrutu November 9, 2023 19:24

mwear force-pushed the automated-status-on-start branch 2 times, most recently from e5a06df to 1415fc5 Compare November 9, 2023 19:58

jmacd approved these changes Nov 9, 2023

View reviewed changes

mwear force-pushed the automated-status-on-start branch 2 times, most recently from 641b268 to 34ef989 Compare November 9, 2023 22:37

mwear added 9 commits November 14, 2023 17:50

Refactor ServiceStatusFunc into a struct; add ReportComponentStatusIf…

0aaf0f4

… method

Add status.Reporter to servicetelemetry.TelemetrySettings

f2ee5b9

Report StatusOK on successful start

a9bc0d5

Return last error for wrapped ReportComponentStatus methods for Share…

11d7230

…dComponent This fixes an existing bug

Simplify reporter

7d79c63

One mutex is sufficient

Add changelog

f9247e6

Cleanup

a48ac79

Embed TelemetrySettingBase as a value

78a38f9

This is mainly because it has a better zero value that requires fewer modifications to existing code.

Remove TelemetrySettingsBase; duplicate shared code instead

d1fd02a

Embedding TelemetrySettingsBase is a bit of a pain and is causing failures in contrib. The path of least resistance is to duplicate the code shared between the component and service TelemetrySettings.

mwear force-pushed the automated-status-on-start branch from 34ef989 to d1fd02a Compare November 15, 2023 02:07

djaglowski reviewed Nov 15, 2023

View reviewed changes

service/internal/graph/graph.go Outdated Show resolved Hide resolved

mwear force-pushed the automated-status-on-start branch from 8fc1031 to 242ef9a Compare November 16, 2023 18:17

mwear force-pushed the automated-status-on-start branch from 242ef9a to 24957ad Compare November 16, 2023 18:26

djaglowski approved these changes Nov 16, 2023

View reviewed changes

codeboten reviewed Nov 16, 2023

View reviewed changes

component/telemetry.go Show resolved Hide resolved

service/service.go Show resolved Hide resolved

Retain and deprecate TelemetrySettingsBase

eb03379

This is public API, so we need to follow the deprecation process.

codeboten approved these changes Nov 28, 2023

View reviewed changes

component/telemetry.go Outdated Show resolved Hide resolved

Update component/telemetry.go

8ebee4d

codeboten merged commit 433f7ae into open-telemetry:main Nov 28, 2023
32 checks passed

github-actions bot added this to the next release milestone Nov 28, 2023

crobert-1 mentioned this pull request Jan 26, 2024

Kafka receiver stuck while shutting down at v0.93.0 open-telemetry/opentelemetry-collector-contrib#30789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate status reporting on start #8836

Automate status reporting on start #8836

mwear commented Nov 9, 2023

jmacd left a comment

codecov bot commented Nov 9, 2023 •

edited

Loading

mwear commented Nov 9, 2023 •

edited

Loading

Automate status reporting on start #8836

Automate status reporting on start #8836

Conversation

mwear commented Nov 9, 2023

jmacd left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2023 • edited Loading

Codecov Report

mwear commented Nov 9, 2023 • edited Loading

codecov bot commented Nov 9, 2023 •

edited

Loading

mwear commented Nov 9, 2023 •

edited

Loading