Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search backpressure cancellation at the coordinator level #5605

Conversation

PritLadani
Copy link
Contributor

@PritLadani PritLadani commented Dec 19, 2022

Description

Cancellation of in-flight SearchTasks based on resource consumption.

This feature aims to identify and cancel resource intensive SearchTasks if they have breached certain
thresholds. This will help in terminating problematic queries which can put nodes in duress and degrade the
cluster performance.

  1. For resource tracking of SearchTasks, we will use the existing thresholds used for SearchShardTask but with different thresholds.
  2. Introduced new dynamic settings for defining thresholds for SearchTask cancellation.
  3. Added search_task stats in the response body of _nodes/stats/search_backpressure.

Issues Resolved

#5173

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: PritLadani [email protected]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: PritLadani <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

) {
List<TaskResourceUsageTracker> trackers = new ArrayList<>();
trackers.add(new CpuUsageTracker(cpuThresholdSupplier));
if (HEAP_SIZE_BYTES > 0) {
Copy link
Collaborator

@reta reta Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have this check on HeapUsageTracker as a static method, fe isHeapTrackingSupported. that would clearly communicate what is the purpose:

if (HeapUsageTracker.isHeapTrackingSupported()) {
 ...
}

If you want to push it further and seal the implementation, introduce the factory method to HeapUsageTracker that return Optional<HeapUsageTracker>, for consumers there won't be any doubts if heap tracking is supported or not:

class HeapUsageTracker {
    static Optional<HeapUsageTracker> create() {
     }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Updated in the next commit.

…com:PritLadani/OpenSearch into searchbackpressure/search-task-cancellation
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase

@@ -52,7 +188,96 @@ public long getTotalHeapBytesThreshold() {
return (long) (HEAP_SIZE_BYTES * getTotalHeapPercentThreshold());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEAP_SIZE_BYTES leaks here as well, so when it is 0, both thresholds are going to be incorrect: SearchBackpressureService::isHeapUsageDominatedBySearch will always return true. I think we have to move this getTotalHeapBytesThreshold to HeapTracker as well, so the decision regarding heap tracking is made in once place only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we change the method isHeapUsageDominatedBySearch like this?

boolean isHeapUsageDominatedBySearch(List<CancellableTask> cancellableTasks, double heapPercentThreshold) {
        long usage = cancellableTasks.stream().mapToLong(task -> task.getTotalResourceStats().getMemoryInBytes()).sum();
        long threshold = (long) heapPercentThreshold * getHeapSizeBytes();
        if (isHeapTrackingSupported() && usage < threshold) {
            logger.debug("heap usage not dominated by search requests [{}/{}]", usage, threshold);
            return false;
        }
        return true;
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is better, thank you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I meant to keep it in SearchBackpressureService only. Hope that's fine?
Like, I don't see a problem in having a public getter for HEAP_SIZE_BYTES. Thoughts?

Copy link
Collaborator

@reta reta Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEAP_SIZE_BYTES should be sealed: you see, your settings would have incorrect values because the knowledge that HEAP_SIZE_BYTES could be 0 is implicit. This is not good, by sealing it and only having method to access it you free up the users of thinking that HEAP_SIZE_BYTES could be incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get it. Should we have a public getter or not?

Copy link
Collaborator

@reta reta Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public getter for getTotalHeapPercentThreshold() - yes, public static long getTotalHeapBytesThreshold(double threshold ) goes to HeapTracker, HEAP_SIZE_BYTES becomes private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Updated the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Setting.Property.NodeScope
);

public static final long HEAP_SIZE_BYTES = JvmStats.jvmStats().getMem().getHeapMax().getBytes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEAP_SIZE_BYTES has to go back to private please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the access of HEAP_SIZE_BYTES in Search[Shard]TaskSettings for getHeapBytesThreshold since that should not be moved to HeapUsageTracker. I will access the value using a public method HeapUsageTracker.getHeapSizeBytes. Hope that's fine.

Copy link
Collaborator

@reta reta Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always calculate the threshold based on getHeapThreshold (this is the only setting you need), and HeapUsageTracker knows how calculate the threshold in bytes (based on HEAP_SIZE_BYTES)

@reta
Copy link
Collaborator

reta commented Feb 3, 2023

I think it is looking good @PritLadani , thank you. Could you please create a complement documentation issue here [1] so the new settings will be properly documented, thank you.

[1] https://github.com/opensearch-project/documentation-website/issues

@reta
Copy link
Collaborator

reta commented Feb 3, 2023

@Bukhtawar mind taking a look please when you have a chance? thank you.

Comment on lines +178 to +193
// Check if increase in heap usage is due to SearchTasks
if (HeapUsageTracker.isHeapUsageDominatedBySearch(
searchTasks,
getSettings().getSearchTaskSettings().getTotalHeapPercentThreshold()
)) {
cancellableTasks.addAll(searchTasks);
}

// Check if increase in heap usage is due to SearchShardTasks
if (HeapUsageTracker.isHeapUsageDominatedBySearch(
searchShardTasks,
getSettings().getSearchShardTaskSettings().getTotalHeapPercentThreshold()
)) {
cancellableTasks.addAll(searchShardTasks);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discussed this with @ketanv3 and decided that we will keep this for now. We can decide on removing this check in the later phase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is domination heap based. Can it be CPU dominated as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be and @ketanv3 is doing some rounds of testing with this. We will address this in the next PR.

Comment on lines +43 to 50
searchShardTaskStats = new SearchShardTaskStats(in);
mode = SearchBackpressureMode.fromName(in.readString());
if (in.getVersion().onOrAfter(Version.V_3_0_0)) {
searchTaskStats = in.readOptionalWriteable(SearchTaskStats::new);
} else {
searchTaskStats = null;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set version to V_2_6 and then backport

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar that will fail all main builds :(, including this pull request, we probably should merge -> backport -> and fix main

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a better way to do this, but I agree that @reta's suggestion is the least disruptive (and avoids the bad look of merging a PR with failing tests).

Copy link
Collaborator

@Bukhtawar Bukhtawar Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the BWC tests fail between the backport and the second main fix. The second problem is the automatic back port starts to fail since it would need a change to point to 2_6. Seems just too much overhead on backports.
Instead if we can auto-trigger a backport PR and keep it handy before we merge to main and merge backport shortly after, it minimises the chances of build failures and simplifies the process.

@reta @andrross Thoughts?

Copy link
Collaborator

@reta reta Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Bukhtawar, yes, once 2.x backport is merged, the main would start to fail but:

  • the time window between main being broken and 2.x being merged is unpredictable, could take hours blocking every other merge to main
  • the pull request to main could be sent right with backport to 2.x - yes, it will fail, but it is just a matter of restarting the Gradle check to get it merged (after 2.x merge)

The process is far from ideal, but the disruption time would be minimal in this case, I think. Also, we should never merge the pull requests with failing check - this is not possible with the main change going first.

CHANGELOG.md Outdated
@@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Add getter for path field in NestedQueryBuilder ([#4636](https://github.com/opensearch-project/OpenSearch/pull/4636))
- Allow mmap to use new JDK-19 preview APIs in Apache Lucene 9.4+ ([#5151](https://github.com/opensearch-project/OpenSearch/pull/5151))
- Add support for ppc64le architecture ([#5459](https://github.com/opensearch-project/OpenSearch/pull/5459))
- Cancellation of in-flight SearchTasks based on resource consumption ([#5606](https://github.com/opensearch-project/OpenSearch/pull/5605))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this go into the next 2.x release? If so, please move this to the unreleased 2.x section and make sure the appropriate labels get added to the PR.

Also, it's probably better to link to the issue (#5173), and can you rewrite this in the imperative mood? Maybe something like "Add backpressure cancellation of search tasks at the coordinator level"? But please feel free to rewrite that.

Copy link
Contributor Author

@PritLadani PritLadani Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, this should go in 2.x release, will move it to unreleased 2.x section and update labels.
Sure, will rewrite it as "Add search backpressure cancellation at the coordinator level".
About linking the issue, I have already mentioned the issue id in the PR description. What else should be done here?

Edit: I am not able to add the labels yet. Is it possible for you to add the labels? or @Bukhtawar can you help me add the labels to the PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the labels. I just mean put the link to the issue instead of the PR in the changelog. It has more relevant information for a user that wants to learn more about this change.

Copy link
Contributor Author

@PritLadani PritLadani Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the labels.
Agree on relevancy but I do not see any entry with the issue link. Am I missing something or it's fine to add issue link?

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

@PritLadani PritLadani changed the title Cancellation of in-flight SearchTasks based on resource consumption Add search backpressure cancellation at the coordinator level Feb 3, 2023
@andrross andrross added backport 2.x Backport to 2.x branch v2.6.0 'Issues and PRs related to version v2.6.0' labels Feb 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2023

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit 74912d2 into opensearch-project:main Feb 5, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5605-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 74912d252e803c67b18ac74de508d1e87d1bc3c4
# Push it to GitHub
git push --set-upstream origin backport/backport-5605-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-5605-to-2.x.

PritLadani added a commit to PritLadani/OpenSearch that referenced this pull request Feb 6, 2023
…arch-project#5605)

* Cancellation of in-flight search requests at coordinator level

Signed-off-by: PritLadani <[email protected]>
(cherry picked from commit 74912d2)
PritLadani added a commit to PritLadani/OpenSearch that referenced this pull request Feb 6, 2023
…arch-project#5605)

* Cancellation of in-flight search requests at coordinator level

Signed-off-by: PritLadani <[email protected]>
(cherry picked from commit 74912d2)
Signed-off-by: PritLadani <[email protected]>
Bukhtawar pushed a commit that referenced this pull request Feb 7, 2023
…tor level (#5605) (#6194)

* Add search backpressure cancellation at the coordinator level (#5605)

* Cancellation of in-flight search requests at coordinator level

Signed-off-by: PritLadani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch v2.6.0 'Issues and PRs related to version v2.6.0'
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants