Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workload Management] QueryGroup Resource Cancellation #15151

Conversation

kiranprakash154
Copy link
Contributor

@kiranprakash154 kiranprakash154 commented Aug 7, 2024

Description

Includes the Resource Cancellation framework changes for Workload management, this is an extension to #13897

Test coverage

image

Related Issues

Resolves - #14883

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kiranprakash154 kiranprakash154 self-assigned this Aug 7, 2024
@kiranprakash154 kiranprakash154 added the backport 2.x Backport to 2.x branch label Aug 7, 2024
@kiranprakash154 kiranprakash154 changed the title cancellation related [Workload Management] QueryGroup Resource Cancellation framework and implementation Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for 1a27813: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for 3c77d47: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kiranprakash154 kiranprakash154 changed the title [Workload Management] QueryGroup Resource Cancellation framework and implementation [Workload Management] QueryGroup Resource Cancellation Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for 3d11693: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for 8dd0cfa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for cbb51bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 4e846e2: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.index.shard.RemoteStoreRefreshListenerTests.classMethod
      1 org.opensearch.index.shard.RemoteStoreRefreshListenerTests.testRefreshSuccessAfterFailureInFirstAttemptAfterSnapshotAndMetadataUpload

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

@kaushalmahi12 kaushalmahi12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@kiranprakash154 Can we fix precommit and assemble errors ?

.getResourceUsageData();

for (ResourceType resourceType : TRACKED_RESOURCES) {
if (queryGroup.getResourceLimits().containsKey(resourceType) && queryGroupResourceUsage.containsKey(resourceType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right side of this && operator is not required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just a check to proceed with isBreachingThreshold only if the querygroup has limits set on resourceType and there is usage recorded for the usage type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true but it is unnecessary because the queryGroupResourceUsage map contains resource usage for all tracked resources

Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Copy link
Contributor

❌ Gradle check result for 80dd918: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Kiran Prakash <[email protected]>
Signed-off-by: Kiran Prakash <[email protected]>
Copy link
Contributor

❌ Gradle check result for a469e33: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 3a5c03a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +246 to +252
// Get the total CPU time of the process in milliseconds
long cpuTotalTimeInMillis = ProcessProbe.getInstance().getProcessCpuTotalTime();
double nodeLevelCancellationThreshold = this.workloadManagementSettings.getNodeLevelCpuCancellationThreshold()
* cpuTotalTimeInMillis;
// Check if resource usage is breaching the threshold
threshold = (long) (resourceThresholdInPercentage * nodeLevelCancellationThreshold);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct way, since we are taking process's total CPU time which could be well lets say in hours/days/months magnitude while on the other the hand, tasks may only be running since last couple milliseconds or even minutes in worst case before the request times out.

This will not refelect actual nano/milli seconds worth of a percentage. We are also not considering the number of processors available in this calculation as well

@kaushalmahi12
Copy link
Contributor

Closing this PR in favor of : #15651
Since it already addresses what this PR is trying yo achieve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Search:Resiliency
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants