Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] improve monitor state loader failure attempts #39621

Merged

Conversation

vigneshshanmugam
Copy link
Member

Description

PR addresses couple of things on the state loader

  • We were waiting unnecessary for more than 4 seconds after all of the loading attempts.
  • Failure attempts were too frequent like the connection was reestablished in 0, 135 ms, 1200ms, 4300 ms, Now with the change, we will start from 0, 1200ms, 4300ms and exit with error code.
  • Improve the logs around the last attempt and keep the message consistent.

How to test this PR locally

  • Create a heartbeat browser/lightweight monitor
  • Spin up a local ES
  • Close the local ES once the initial connection gets established
  • Wait for the monitor to get finished, then check the attempts.

@vigneshshanmugam vigneshshanmugam requested a review from a team as a code owner May 18, 2024 00:16
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 18, 2024
Copy link
Contributor

mergify bot commented May 18, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @vigneshshanmugam? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@vigneshshanmugam vigneshshanmugam added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels May 18, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 18, 2024
@vigneshshanmugam vigneshshanmugam added needs_team Indicates that the issue/PR needs a Team:* label backport-v8.14.0 Automated backport with mergify labels May 18, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 18, 2024
@vigneshshanmugam vigneshshanmugam requested a review from a team as a code owner May 18, 2024 06:20
@dliappis
Copy link
Contributor

dliappis commented May 20, 2024

Hey @vigneshshanmugam ,

Is this ready for review? I can confirm that the failures bubbled up by CI are reproducible and don't exist e.g. on the regular runs on main https://buildkite.com/elastic/heartbeat/builds/5135
I was also able to also reproduce the hanging (Go) unit tests:


=== Failed
=== FAIL: heartbeat/monitors TestMonitorBasic (1.00s)
    monitor_test.go:103: No publishes detected!

=== FAIL: heartbeat/monitors TestMonitorCfgError (1.00s)
    monitor_test.go:103: No publishes detected!

=== FAIL: heartbeat/monitors/active/http TestUpStatuses (unknown)


=== FAIL: heartbeat/monitors/active/http TestUpStatuses/Test_OK_HTTP_status_208_using_hosts_config_field (unknown)
signal: interrupt

which I suppose is the reason why you tried adding an explicit timeout in the Buildkite steps.

@vigneshshanmugam
Copy link
Member Author

@dliappis Thanks for the link on the main branch, there was a bug in the code that caused the timeouts and made me think increasing would help. Should be fixed in the next run 👍🏽

Copy link
Collaborator

@emilioalvap emilioalvap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (all CI is green, no changes related to ingest-eng-prod owned codebase)

@vigneshshanmugam vigneshshanmugam merged commit d5bfebb into elastic:main May 24, 2024
19 checks passed
@vigneshshanmugam vigneshshanmugam deleted the improve-state-loader-msg branch May 24, 2024 14:59
mergify bot pushed a commit that referenced this pull request May 24, 2024
* [Heartbeat] improve state loader failure logs

* try increasing timeouts

* exit when there is no error

* add state loader id

(cherry picked from commit d5bfebb)
vigneshshanmugam added a commit that referenced this pull request May 27, 2024
…39729)

* [Heartbeat] improve state loader failure logs

* try increasing timeouts

* exit when there is no error

* add state loader id

(cherry picked from commit d5bfebb)

Co-authored-by: Vignesh Shanmugam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.14.0 Automated backport with mergify bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants