Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Permit backed futures to prevent timeouts during upload bursts #12159

Merged
merged 6 commits into from
May 13, 2024

Conversation

vikasvb90
Copy link
Contributor

@vikasvb90 vikasvb90 commented Feb 5, 2024

Description

During burst of uploads happening typically in finalize recovery in cases like shrink, split and force merge where lot of segments or large segments become available for upload, large number of requests get queued up behind connection pool. This either results in timeouts due to failure in acquiring a connection or idle connection timeout where an ongoing request takes too long to read and compute data for upload which is because of high wait time for acquiring a thread in stream reader pool. It is more prevalent in async flow since main thread doesn't wait for the response and everything ends up getting submitted for upload. Both sync and async S3 SDK apis do not have a way today to handle such bursts.

This PR resolves these problems by applying natural backpressure on main thread with the help of backing permits. It also adds retries on future in case of a SDK exception or failure in acquisition of a permit. This means that in case of multi-part upload, a failing part can be independently retried.

Testing

  1. During post recovery of a 98gb shard on a r7g.medium box after split of nyc_taxis index, I did not observe any IO timeout. Concurrent execution of so workload benchmark also did not produce any timeout error.
  2. No impact on indexing performance on executing benchmarks on main build and build with this PR.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 5, 2024

❌ Gradle check result for d22dbe1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented May 4, 2024

❌ Gradle check result for 3defb7b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented May 4, 2024

❌ Gradle check result for 58d730b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8c54c68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8c54c68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8c54c68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for c394e90: SUCCESS

@gbbafna gbbafna merged commit c328c18 into opensearch-project:main May 13, 2024
28 checks passed
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label May 13, 2024
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-12159-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 c328c18f69c08b21aa8a76af270b04a70c5a8069
# Push it to GitHub
git push --set-upstream origin backport/backport-12159-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-12159-to-2.x.

vikasvb90 added a commit to vikasvb90/OpenSearch that referenced this pull request May 13, 2024
vikasvb90 added a commit to vikasvb90/OpenSearch that referenced this pull request May 14, 2024
gbbafna pushed a commit that referenced this pull request May 14, 2024
@vikasvb90 vikasvb90 mentioned this pull request May 15, 2024
9 tasks
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request May 17, 2024
parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed skip-changelog Storage:Remote Storage:Resiliency Issues and PRs related to the storage resiliency v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

6 participants