Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make event acknowledgment asynchronous in shipper output #32785

Merged
merged 7 commits into from
Aug 31, 2022

Conversation

rdner
Copy link
Member

@rdner rdner commented Aug 23, 2022

What does this PR do?

Implements an asynchronous approach for acknowledgment of event batches replacing the previous blocking approach.

Why is it important?

So, the event pipeline is not blocked because of a single batch and keeps publishing events to the shipper.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

So we can keep publishing batches not blocking on a single batch to be acknowledged.
@rdner rdner self-assigned this Aug 23, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2022
@mergify
Copy link
Contributor

mergify bot commented Aug 23, 2022

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @rdner? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@rdner rdner added the v8.5.0 label Aug 23, 2022
libbeat/outputs/shipper/api/shipper_mock.go Show resolved Hide resolved
libbeat/outputs/shipper/api/shipper_mock.go Show resolved Hide resolved
libbeat/outputs/shipper/shipper.go Show resolved Hide resolved
libbeat/outputs/shipper/shipper.go Show resolved Hide resolved
libbeat/outputs/shipper/shipper.go Show resolved Hide resolved
libbeat/outputs/shipper/shipper.go Outdated Show resolved Hide resolved
@rdner rdner added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 23, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2022
@rdner rdner marked this pull request as ready for review August 23, 2022 17:07
@rdner rdner requested a review from a team as a code owner August 23, 2022 17:07
@rdner rdner requested review from cmacknz and fearful-symmetry and removed request for a team August 23, 2022 17:07
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@rdner rdner requested a review from faec August 23, 2022 17:08
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 23, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-30T09:39:20.960+0000

  • Duration: 86 min 16 sec

Test stats 🧪

Test Results
Failed 0
Passed 22747
Skipped 1947
Total 24694

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

libbeat/outputs/shipper/shipper.go Outdated Show resolved Hide resolved
libbeat/outputs/shipper/shipper.go Outdated Show resolved Hide resolved
s.pendingMutex.Lock()
lastProcessed := 0
for _, p := range s.pending {
if p.serverID != indexReply.Uuid {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going along with the initialization comment above: the cleanup from a mismatched uuid should happen in Publish and/or Close -- the uuid will never change during an active connection, so after the first iteration this would just be caught by the err != nil check above and none of the outstanding batches would be cancelled. (It might also be nice for readability to move this into a standalone cancelAllBatches helper.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment #32785 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but in that case can we move the cleanup so it happens in Connect when we set s.serverID? It will still never change over the course of a connection, so this check could be skipped -- keeping one-time initialization separate makes the logic of ackLoop clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to trying to trying to keep the cleanup in Connect to simplify the rest of the logic.

Copy link
Member Author

@rdner rdner Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not going to simplify the rest of the logic, in fact it's going to add one more lock for the pending queue in a different place that can possibly cause a deadlock with the client lock. Also, I would have to copy items in two places. It's more robust to keep it the way it is and I don't see any reason to move it. Unless there is a good argument why the current state of the code does not work as intended I'm going to keep it.

TLDR; I think having only one place where we lock and mutate this pending slice is cleaner and safer. And moving this code gains no obvious benefit.

@rdner
Copy link
Member Author

rdner commented Aug 29, 2022

CI failures (metricbeat) are unrelated to changes in the PR.

@rdner rdner requested review from faec and leehinman August 29, 2022 15:23
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

libbeat/outputs/shipper/shipper.go Outdated Show resolved Hide resolved
s.pendingMutex.Lock()
lastProcessed := 0
for _, p := range s.pending {
if p.serverID != indexReply.Uuid {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but in that case can we move the cleanup so it happens in Connect when we set s.serverID? It will still never change over the course of a connection, so this check could be skipped -- keeping one-time initialization separate makes the logic of ackLoop clearer.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted one small issue (I think), otherwise LGTM.

@rdner rdner requested review from cmacknz and faec August 30, 2022 11:06
@rdner rdner merged commit ec2ed6a into elastic:main Aug 31, 2022
@rdner rdner deleted the async-shipper-ack branch August 31, 2022 06:52
v1v added a commit to v1v/beats that referenced this pull request Sep 1, 2022
…ackaging

* upstream/main: (109 commits)
  Add cap_net_raw requirements to heartbeat docs (elastic#32816)
  apply a quick hotfix for having main working properly (elastic#32934)
  action: checks for x-pack/libbeat and libbeat (elastic#32754)
  Update to Go 1.18 in go.mod. (elastic#32940)
  [heartbeat] disable browser code on windows via build tags (elastic#32939)
  action: checks for heartbeat and x-pack/heartbeat (elastic#32749)
  Make event acknowledgment asynchronous in shipper output (elastic#32785)
  [Automation] Update elastic stack version to 8.5.0-fedc3e60 for testing (elastic#32930)
  Preallocate memory to reduce GC load (elastic#32905)
  [Automation] Update elastic stack version to 8.5.0-440e0896 for testing (elastic#32919)
  Skip broken ceph tests. (elastic#32912)
  Use non-deprecated docker image for testing jolokia (elastic#32885)
  update ironbank image product name (elastic#32867)
  ci: pre-commit stage within Jenkins (elastic#32839)
  Fix a couple of bugs in the logic for how AWS metric periods are calculated (elastic#32724)
  [Filebeat] [httpjson] Add support for single string containing multiple relation-types in getRFC5988Link (elastic#32811)
  [Heartbeat] Update HB k8s template to use <Mi> metric (elastic#32801)
  action: checks for metricbeat and x-pack/metricbeat (elastic#32748)
  action: checks for filebeat and x-pack/filebeat (elastic#32746)
  allow for json/ndjson content type with charset (elastic#32767)
  ...
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
So we can keep publishing batches not blocking on a single batch to be acknowledged.
Also updated the config documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.5.0
Projects
None yet
5 participants