Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic when more than 32767 pipeline clients are active #38556

Merged
merged 9 commits into from
Apr 12, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Mar 22, 2024

Proposed commit message

The publishing pipeline would panic when more than 32767 clients were active, that happened because each client would add two channels to a slice and an infinity loop would use reflect.Select on this list. reflect.Select supports a maximum of 65536 cases, if there are more it panics.

This PR fixes this by removing the need for this list. The pipeline used an infinity loop to detect if the CloseRef from the pipeline client was closed, if it happened, then it would call client.Close. By analysing the code setting CloseRef we identified there was no need for the pipeline to be responsible for propagating this signal, most of the times the pipeline client was created, there was already a defer to close it, in the few places where it was not present we added one.

Now the pipeline is not responsible for closing any client, whomever creates a client is responsible for correctly closing it.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

This PR fixes a bug by removing a feature that is not needed. That said, it is easy to reproduce the situation that would lead to the crash.

All configuration files and scripts are at the end of this section.

  1. Create at least 33.000 files, you can use ./gen-logs.sh 33000
  2. Start Filebeat with Filestream input reading those files. Use filebeat.yml
  3. Wait until all events are ingested, you can do this by checking the number of lines in the output file: wc -l output*.ndjson, it should report 33.000.000 entries.
  4. Ensure Filebeat does not panic.
  5. Remove all created log files from your system: rm -rf /tmp/too-many-logs
filebeat.yml

filebeat.inputs:
  - type: filestream
    id: do-not-panic
    paths:
      - /tmp/too-many-logs/*

output:
  file:
    enabled: true
    codec.json:
      pretty: false
    path: ${path.home}
    filename: "output"
    rotate_on_startup: false
    rotate_every_kb: 1000000

Related issues

## Use cases
## Screenshots
## Logs

@belimawr belimawr added skip-ci Skip the build in the CI but linting Team:Elastic-Agent Label for the Agent team labels Mar 22, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 22, 2024
Copy link
Contributor

mergify bot commented Mar 22, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

⏳ Build in-progress, with failures

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

⏳ Build in-progress, with failures

Failed CI Steps

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

⏳ Build in-progress, with failures

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

⏳ Build in-progress, with failures

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

⏳ Build in-progress, with failures

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

💔 Build Failed

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 150 min 32 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

💔 Build Failed

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@belimawr belimawr added bug and removed skip-ci Skip the build in the CI but linting labels Mar 22, 2024
@belimawr belimawr requested a review from faec March 22, 2024 16:09
@belimawr belimawr force-pushed the fix-pipeline-panic-v3 branch 2 times, most recently from dc55d3e to f4e1b38 Compare March 27, 2024 14:32
@belimawr belimawr marked this pull request as ready for review March 27, 2024 14:37
@belimawr belimawr requested review from a team as code owners March 27, 2024 14:37
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@belimawr belimawr requested a review from rdner March 27, 2024 14:37
@belimawr belimawr changed the title [WIP] Fix panic when more than 32767 clients are active Fix panic when more than 32767 clients are active Mar 27, 2024
Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter issues need to be fixed but the change looks good to me.

@belimawr belimawr force-pushed the fix-pipeline-panic-v3 branch 2 times, most recently from 82c201a to 81e6cc7 Compare April 9, 2024 06:59
@belimawr
Copy link
Contributor Author

@elastic/obs-cloud-monitoring and @elastic/sec-deployment-and-devices could you take a look at this PR? At least at the files you own?

@pierrehilbert
Copy link
Collaborator

cc @bturquet & @norrietaylor

Copy link
Contributor

@kaiyan-sheng kaiyan-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awscloudwatch and awss3 input change looks good to me.

@pkoutsovasilis
Copy link
Contributor

@belimawr @pierrehilbert when this one is addressed I am pressing the approve button. Also I would propose to have an entry in CHANGELOG.next.asciidoc to capture the change in ClientConfig

@belimawr
Copy link
Contributor Author

belimawr commented Apr 12, 2024

@belimawr @pierrehilbert when this one is addressed I am pressing the approve button. Also I would propose to have an entry in CHANGELOG.next.asciidoc to capture the change in ClientConfig

CHANGELOG.next.asciidoc is the user-facing changelog, I do not believe internal code changes should go there. However I missed the changelog entry.

I added some changelog entries in 85e42dd

Refactor tests so they do not depend on `client.closeRef`
Remove all references to CloseRef, including the ones that were
commented out.
The pipeline client used in some tests could not have its Close method
called more than once, given the changes introduces in the previous
commits, some test scenarios called it more than once.

This is fixed by using a sync.Once in ChanClient.Close.
Fix number of clients to close, add some comments and better name a
variable.
Copy link
Contributor

mergify bot commented Apr 12, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-pipeline-panic-v3 upstream/fix-pipeline-panic-v3
git merge upstream/main
git push upstream fix-pipeline-panic-v3

@belimawr
Copy link
Contributor Author

Folks, given the PR is pretty much approved, all comments have been addressed and I'm going on PTO, I've enabled auto-merge.

@belimawr belimawr changed the title Fix panic when more than 32767 clients are active Fix panic when more than 32767 pipeline clients are active Apr 12, 2024
@belimawr belimawr merged commit e993e09 into elastic:main Apr 12, 2024
196 of 202 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beats panic if there are more than 32767 pipeline clients
7 participants