Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Add capability of doing refresh as determined by the translog #12992

Merged
merged 18 commits into from
Apr 25, 2024

Conversation

astute-decipher
Copy link
Contributor

@astute-decipher astute-decipher commented Apr 1, 2024

Description

After the launch of remote store, We discontinued to allow customer to set Refresh interval = -1, as it can lead to large number of uncommitted Translog files. We want to start allowing it while controlling the number of uncommitted Translog files explicilty. We will introduce a new index setting which will limit the concurrent number of Translog files, once it's breached we will refresh the index and trim unreferenced translog files.

Related Issues

Resolves #12984

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Storage:Remote labels Apr 1, 2024
Copy link
Contributor

github-actions bot commented Apr 1, 2024

❌ Gradle check result for c9ce216: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 1, 2024

Compatibility status:

Checks if related components are compatible with change efd3e17

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/sql.git]

Copy link
Contributor

@stephen-crawford stephen-crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @astute-decipher, thanks for opening this.

Just a couple things:

  1. Can you clarify why you prefer the -1 option? I understand why you removed it but I am not sure why you would want to add it back since it sounds like it is always detrimental by storing extra translog files?

  2. Could you please add unit tests for the individual changes you made and then a few integration tests to verify the translog files are properly wiped with your changes and that the max number setting is respected

Thank you

@astute-decipher
Copy link
Contributor Author

Hi @scrawfor99, In response to your comment :

  1. Can you clarify why you prefer the -1 option? I understand why you removed it but I am not sure why you would want to add it back since it sounds like it is always detrimental by storing extra translog files?

I agree, it is always detrimental to store extra translog files, but it will be helpful in ongoing effort to allow migration from DocRep to Remote-Store #12718, As if someone is in DocRep mode and is using Refresh Interval =-1, will not have to change it explicitly while migrating to remote-store. Also even in today's scenario a very high Refresh Interval (say >10min or a day) can simulate the problem of large number of translog files, and hence these changes can handle it.

  1. Could you please add unit tests for the individual changes you made and then a few integration tests to verify the translog files are properly wiped with your changes and that the max number setting is respected

Sure will be adding them soon.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

❌ Gradle check result for c182f3a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Shubh Sahu added 2 commits April 3, 2024 14:13
…erWriteAction, which trims unreferenced translog readers after every write

Signed-off-by: Shubh Sahu <[email protected]>
Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the overall idea, have left some comments.

@ashking94
Copy link
Member

Lets also add UTs and ITs for the same. Lets also get the PR build to green.

Signed-off-by: Shubh Sahu <[email protected]>
@astute-decipher astute-decipher changed the title Reallow index & cluster default Refresh Interval to be set as -1 [Remote Store] Reallow index & cluster default Refresh Interval to be set as -1 Apr 5, 2024
Copy link
Contributor

github-actions bot commented Apr 5, 2024

❌ Gradle check result for 0e1a2d3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Shubh Sahu <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 5, 2024

❌ Gradle check result for 4fd9bf5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 5, 2024

❌ Gradle check result for 90bcd5e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@astute-decipher astute-decipher marked this pull request as ready for review April 7, 2024 15:46
Copy link
Contributor

❌ Gradle check result for 6d93d6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for efd3e17: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Shubh Sahu added 2 commits April 18, 2024 14:06
Signed-off-by: Shubh Sahu <[email protected]>
Signed-off-by: Shubh Sahu <[email protected]>
Copy link
Contributor

✅ Gradle check result for 6330d16: SUCCESS

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

❌ Gradle check result for 8a81cb7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 503f8ec: SUCCESS

@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Apr 22, 2024
@astute-decipher astute-decipher self-assigned this Apr 22, 2024
@gbbafna gbbafna merged commit 9e9ab6b into opensearch-project:main Apr 25, 2024
30 of 31 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-12992-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9e9ab6b3c406be2dcc701934ff03c0e56911819a
# Push it to GitHub
git push --set-upstream origin backport/backport-12992-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-12992-to-2.x.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astute-decipher I have left some comments. You might want to address these.

Comment on lines +443 to +445
if (translog.shouldFlush()) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see reference of looping in the following java doc couple of lines below this change. Have we verified that there is no potential infinite loop or recursiveness here?


// indexing 100 documents (100 bulk requests), no flush will be triggered yet
for (int i = 0; i < 100; i++) {
indexBulk(INDEX_NAME, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will be increasing the test execution time to 90-100s. We can reduce the test execution by setting the translog buffer interval to zero. Can we please do this change?

astute-decipher added a commit to astute-decipher/OpenSearch that referenced this pull request Apr 25, 2024
astute-decipher added a commit to astute-decipher/OpenSearch that referenced this pull request Apr 25, 2024
astute-decipher added a commit to astute-decipher/OpenSearch that referenced this pull request Apr 25, 2024
gbbafna pushed a commit that referenced this pull request Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request Storage:Remote
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Remote Store] Reallow index & cluster default Refresh Interval to be set as -1.
5 participants