-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qsb framework changes #13007
Qsb framework changes #13007
Conversation
Signed-off-by: Kaushal Kumar <[email protected]>
Signed-off-by: Kaushal Kumar <[email protected]>
Signed-off-by: Kaushal Kumar <[email protected]>
Compatibility status:Checks if related components are compatible with change a4f6c80 Incompatible componentsSkipped componentsCompatible components |
Signed-off-by: Kaushal Kumar <[email protected]>
❌ Gradle check result for a4f6c80: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 239e5ec: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Thanks @kaushalmahi12. The linked RFC has a lot of discussion, but I don't see consensus on a path forward. The one thing for which it seems like the is consensus is not using the word "sandbox" for this feature. I would echo @peternied's feedback on the RFC: create a minimal proof-of-concept that shows what this feature will do and how users/admins will interact with it. It's really hard to comment on a skeleton when we don't have agreement on the high level feature. |
Thanks @andrross for looking at this and providing your useful feedback, If the naming is the only concern we can come to a agreed upon naming convention for this feature and same can be reflected here. I didn't get it from the discussion on the RFC that majority of the folks were aligned on creating a POC first for this. If that is something to move forward with, we can definitely work from that point as well. |
My takeaway from the RFC discussion is that there is no clear consensus around this feature. The POC suggestion is just one approach to help build that consensus. It's certainly not the only way to go though. The main point here is that without clear consensus toward a goal then you're not likely to get much useful feedback on a skeleton PR like this. |
Got it!, @andrross Regarding consensus aspect, I would really appreciate your help if you could help me with actionable feedback on how can we should go about it ? Although I have replied to the last few comments here. Can you also provide your feedback on the RFC ? |
I remember from the community meeting that there were a lot of concerns about the approach, and especially the large number of new entities, concepts, abstractions, and settings being created. For example, this PR adds a new service, a new task interface, and new search logic injected into core. By comparison, I think associating a search request with a given "sandbox" (really, I think "tagging" or "labeling" queries is a better abstraction) could be done with no code change in core with a More broadly, I think we should settle on one abstraction that we use to associate requests with tenants in multi-tenant system. I still like the idea of "named entrypoints" (what @peternied called a "view"). I imagine them as different entrances to the OpenSearch amusement park, where using a given entrance requires some particular credentials, and each entrance gives you a different-colored wristband that determines what indices you query, what kind of queries you are allowed to run, and whether you'll get kicked out of the park if it gets too busy. (We can also track the top N queries for each color of wristband.) |
Thanks @msfroh ! for taking the time to go through this. We are also assigning a label to these requests and passing this on to corresponding tasks in the flow. I like the idea of tagging the requests through a Plugin. Regarding the new service for tracking the task resource usage I think we have to have a periodically running thread to calculate the active resource usage of sandboxes/queryGroups/someBetterName. Do you have any additional thoughts on overcoming this ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions and comments. Overall looks like a strong framework. We will definitely want to add some tests before looking to actually merge any of this.
@@ -50,9 +51,10 @@ | |||
* @opensearch.api | |||
*/ | |||
@PublicApi(since = "1.0.0") | |||
public class SearchShardTask extends CancellableTask implements SearchBackpressureTask { | |||
public class SearchShardTask extends CancellableTask implements SearchBackpressureTask, SandboxableTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By making the SanboxableTask an interface like this it would imply there are numerous tasks which may eventually support sandboxing (or whatever it ends up being called); do you have some other ideas around what this may look like?
I think that may help with defining the interface more accurately. Personally, I had thought it was only for search so I could have been misunderstanding things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were thinking that this feature can supersede the AdmissionControl and SearchBackpressure, We can extend this idea to indexing traffic as well.
* @param request is a coordinator search request | ||
* @return matching sandboxId based on request attributes | ||
*/ | ||
public String resolveSandboxFor(final SearchRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just a draft, but I am not sure how this will work at the moment. Can you clarify your plans for managing this?
I would likely expect the search request to be able to identify its own sandbox if it is extending the sandboxabletask interface. Otherwise I think we would not need to extend it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely get merged into search pipelines now, Given that we are doing nothing but tagging these requests.
To briefly describe this, We will do the tagging (one time process) of incoming search requests at co-ordinator node and propagate this tagged information to shard level requests as well, this will help us group the requests into these different buckets/sandboxes
/** | ||
* Main class to hold sandbox level stats | ||
*/ | ||
public class SandboxStatsHolder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a couple examples of the stats you expect to track here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will track the following per sandbox
- runningTaskCount
- cancelledTaskCount
- rejectedTaskCount
- currentResourceUsage
- completedTaskCount
server/src/main/java/org/opensearch/search/sandbox/SandboxedRequestTrackerService.java
Show resolved
Hide resolved
return null; | ||
} | ||
|
||
private AbsoluteResourceUsage calculateAbsoluteResourceUsageFor(Task task) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we always able to access this information on all distributions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, This will give us approximately correct resource usage for a task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaushalmahi12 I appreciate the effort to put forth more changes towards this space thanks for the time you've put into this change.
Three maintainers, myself, @msfroh and @andrross are voicing concern about this feature. There has been considerable dialog in the RFC calling out how to move forward. I'm not sure how this pull request is related to those questions - as the concerns are not implementation specific, but user experience and overall design related.
I do not see how the feature can move forward with smaller RFCs or POC focused around the user experience over all design. Here is what is in my mind as the most burning questions.
- How does a sysadmin know that they should use this feature?
- Why aren't sysadmins creating separate OpenSearch clusters?
- Why isn't admission control enough?
- What is the impact on OpenSearch clients this feature is enabled?
- Return errors such as 429s or delay responses in a queue?
- Are there additional query parameters that are required?
- How do sys admins know the feature is working as expected?
- How is routing/filtering verifiable?
- Are existing health metrics still viable, does health need to be considered different for different sandboxes?
These each sound like 3 or more RFCs that an be driven to consensus with the project's maintainers. Please work together with the community towards alignment and shared understanding.
We will need to educate the users through blogs, Usecase: If the admins are facing frequent regressions in the cluster due to a group of misbehaving queries, then this is a suitable construct to take actions against such search requests without compromising the stability of the cluster. Although this is very basic use case but this can evolve into something along the workload management in Opensearch.
Can you elaborate more on this ?
Admission Control works is the gatekeeping mechanism and doesn't control the lifecycle of running tasks.
Based on the sandbox definitions, Users are expected to see 429s under load.
No.
We will expose the stats API for this to see various metrices around sandbox, e.g; running tasks, completedTasks etc;
Since the resource constraints for sandboxes might be different, Hence the health metrices for the sandboxes depends on the current resource usage for the sandbox and defined limits. |
I am closing this PR as We are going to move few things into plugin based on my discussion with @msfroh. |
This PR is to add skeleton changes to support the search task tracking under sandboxes/queryGroups. This is the very first PR where I have intentionally left out the testing part and want the feedback from the reviewers on the interfaces/classes front (LLD). Once the I get the green flag on the LLD part, We can fill in the implementation details.
Description
[Describe what this change achieves]
Related Issues
Query Sandboxing Feature
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.