Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SafeProcessor wrapper #34647

Merged
merged 5 commits into from
Feb 24, 2023
Merged

Conversation

rdner
Copy link
Member

@rdner rdner commented Feb 22, 2023

What does this PR do?

Each processor might end up in multiple processor groups. Every group has its own Close that calls Close on each processor of that group which leads to multiple Close calls on the same processor.

If the SafeProcessor wrapper was not used,
the processor would have to handle multiple Close calls with adding sync.Once in its Close function.

We make it easer for processor developers and take care of it in the processor registry instead.

Why is it important?

Our customers could see panics in logs which makes it noisy.

Also, there are reports of higher impact:

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Unfortunately, it's very difficult to reproduce the panic described in the issue, it seems to be a result of some kind of race condition on Beats shutting down.

The change is covered with unit tests and I made sure that a simple processor with the following config still works and there is no regression:

filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    enabled: true
    paths:
      - "metadata-panic/logs/log*.ndjson"
    processors:
      - add_fields:
          target: project
          fields:
            name: myproject
            id: 'UNIQUE_VALUE'

path.data: "metadata-panic/data"
logging:
  level: debug
output.console:
  enabled: true

and I got this event published:

{
  "@timestamp": "2023-02-23T10:33:22.673Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "log": {
    "offset": 32170800,
    "file": {
      "path": "metadata-panic/logs/log1.ndjson"
    }
  },
  "message": "{\"data\": \"sample\"}",
  "input": {
    "type": "filestream"
  },
  "project": {
    "name": "myproject",
    "id": "UNIQUE_VALUE"
  },
  "agent": {
    "ephemeral_id": "926e973d-f35b-4f1e-95aa-6b9cf239e21c",
    "id": "2f6fa2ce-1a37-4259-91e8-1df581639d07",
    "name": "MacBook-Pro.localdomain",
    "type": "filebeat",
    "version": "8.8.0"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "host": {
    "name": "MacBook-Pro.localdomain"
  }
}

Note

"project": {
    "name": "myproject",
    "id": "UNIQUE_VALUE"
  }

Related issues

Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.
@rdner rdner self-assigned this Feb 22, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2023
@rdner rdner added bug bugfix and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 22, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2023
@mergify
Copy link
Contributor

mergify bot commented Feb 22, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @rdner? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 22, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-24T08:20:13.190+0000

  • Duration: 66 min 5 sec

Test stats 🧪

Test Results
Failed 0
Passed 25993
Skipped 1962
Total 27955

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@rdner rdner added backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.7.0 Automated backport with mergify labels Feb 23, 2023
@rdner rdner added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 23, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 23, 2023
@rdner rdner marked this pull request as ready for review February 23, 2023 12:24
@rdner rdner requested a review from a team as a code owner February 23, 2023 12:24
@rdner rdner requested review from belimawr and leehinman and removed request for a team February 23, 2023 12:24
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@rdner rdner requested a review from cmacknz February 23, 2023 12:24
@cmacknz cmacknz added the backport-v8.6.0 Automated backport with mergify label Feb 23, 2023
@rdner
Copy link
Member Author

rdner commented Feb 24, 2023

@cmacknz there will be no 8.6 releases anymore, I don't think we need that backport label.

@rdner rdner requested a review from cmacknz February 24, 2023 14:49
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM.

One optional nit on the name SafeProcessor. The name doesn't tell me what it is safe from, so if you just encountered this in registry.go you wouldn't have many clues as to why the constructor is being wrapped. Names like CloseOnce or AtomicCloser might be better.

@rdner
Copy link
Member Author

rdner commented Feb 24, 2023

@leehinman the name comes from the fact that it's not just addressing the Close function call but also the Run – so, you can't run the processor that was closed before and cause some panics by doing so.

I've put a lot of thought into the naming and I'd rather keep it this way, so we could perhaps extend the functionality of this wrapper in the future, e.g. blocking certain events from being processed by any processor, etc.

@rdner rdner merged commit 5007b58 into elastic:main Feb 24, 2023
@rdner rdner deleted the fix-add-kubernetes-metadata-panic branch February 24, 2023 16:03
@rdner rdner removed the backport-v8.6.0 Automated backport with mergify label Feb 24, 2023
mergify bot pushed a commit that referenced this pull request Feb 24, 2023
Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)
mergify bot pushed a commit that referenced this pull request Feb 24, 2023
Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)
mergify bot pushed a commit that referenced this pull request Feb 24, 2023
Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)
rdner added a commit that referenced this pull request Feb 24, 2023
* Add `SafeProcessor` wrapper (#34647)

Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)

* Remove extra changelog entries

---------

Co-authored-by: Denis <[email protected]>
rdner added a commit that referenced this pull request Feb 24, 2023
* Add `SafeProcessor` wrapper (#34647)

Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)

* Remove extra changelog entries

* Fix different constructor signature

---------

Co-authored-by: Denis <[email protected]>
rdner added a commit that referenced this pull request Feb 27, 2023
Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.

(cherry picked from commit 5007b58)

Co-authored-by: Denis <[email protected]>
rdner added a commit to rdner/beats that referenced this pull request Mar 7, 2023
After introducing the `SafeProcessor` wrapper in
elastic#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.
cmacknz pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

# Conflicts:
#	filebeat/channel/runner.go
#	libbeat/processors/safe_processor.go
cmacknz pushed a commit that referenced this pull request Mar 8, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

Co-authored-by: Denis <[email protected]>
cmacknz pushed a commit that referenced this pull request Mar 8, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

Co-authored-by: Denis <[email protected]>
rdner added a commit that referenced this pull request Mar 8, 2023
#34764)

* Stop re-using processors defined in the config (#34761)

* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

# Conflicts:
#	filebeat/channel/runner.go
#	libbeat/processors/safe_processor.go

* Resolve conflicts, fix changelog

* Add new line to changelog

* Revert comment auto-formatting

---------

Co-authored-by: Denis <[email protected]>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
Each processor might end up in multiple processor groups.
Every group has its own `Close` that calls `Close` on each
processor of that group which leads to multiple `Close` calls
on the same processor.

If the `SafeProcessor` wrapper was not used,
the processor would have to handle multiple `Close` calls with
adding `sync.Once` in its `Close` function.

We make it easer for processor developers and take care of it
in the processor registry instead.
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.7.0 Automated backport with mergify bug bugfix Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add_kubernetes_metadata panics when a Beat shuts down
4 participants