Qsb framework changes #13007

kaushalmahi12 · 2024-04-02T02:28:33Z

This PR is to add skeleton changes to support the search task tracking under sandboxes/queryGroups. This is the very first PR where I have intentionally left out the testing part and want the feedback from the reviewers on the interfaces/classes front (LLD). Once the I get the green flag on the LLD part, We can fill in the implementation details.

Description

[Describe what this change achieves]

Related Issues

Query Sandboxing Feature

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Kaushal Kumar <[email protected]>

github-actions · 2024-04-02T02:32:55Z

Compatibility status:

Checks if related components are compatible with change a4f6c80

Incompatible components

Skipped components

Compatible components

Signed-off-by: Kaushal Kumar <[email protected]>

github-actions · 2024-04-02T02:44:41Z

❌ Gradle check result for a4f6c80: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-04-02T02:56:40Z

❌ Gradle check result for 239e5ec: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

andrross · 2024-04-02T16:34:03Z

Thanks @kaushalmahi12. The linked RFC has a lot of discussion, but I don't see consensus on a path forward. The one thing for which it seems like the is consensus is not using the word "sandbox" for this feature.

I would echo @peternied's feedback on the RFC: create a minimal proof-of-concept that shows what this feature will do and how users/admins will interact with it. It's really hard to comment on a skeleton when we don't have agreement on the high level feature.

kaushalmahi12 · 2024-04-02T17:01:24Z

Thanks @kaushalmahi12. The linked RFC has a lot of discussion, but I don't see consensus on a path forward. The one thing for which it seems like the is consensus is not using the word "sandbox" for this feature.

I would echo @peternied's feedback on the RFC: create a minimal proof-of-concept that shows what this feature will do and how users/admins will interact with it. It's really hard to comment on a skeleton when we don't have agreement on the high level feature.

Thanks @andrross for looking at this and providing your useful feedback, If the naming is the only concern we can come to a agreed upon naming convention for this feature and same can be reflected here.

I didn't get it from the discussion on the RFC that majority of the folks were aligned on creating a POC first for this. If that is something to move forward with, we can definitely work from that point as well.

andrross · 2024-04-02T18:41:06Z

My takeaway from the RFC discussion is that there is no clear consensus around this feature. The POC suggestion is just one approach to help build that consensus. It's certainly not the only way to go though. The main point here is that without clear consensus toward a goal then you're not likely to get much useful feedback on a skeleton PR like this.

kaushalmahi12 · 2024-04-02T21:38:58Z

Got it!, @andrross
What is the minimal expectation from POC ? I think our main concern here is, how would it work from user's point of view. I thought We had covered that in the community meeting and the google doc here

Regarding consensus aspect, I would really appreciate your help if you could help me with actionable feedback on how can we should go about it ? Although I have replied to the last few comments here.

Can you also provide your feedback on the RFC ?

msfroh · 2024-04-02T22:45:57Z

? I think your main concern was, how would it work from user's point of view. I thought We had covered that in the community meeting and the google doc here

I remember from the community meeting that there were a lot of concerns about the approach, and especially the large number of new entities, concepts, abstractions, and settings being created.

For example, this PR adds a new service, a new task interface, and new search logic injected into core.

By comparison, I think associating a search request with a given "sandbox" (really, I think "tagging" or "labeling" queries is a better abstraction) could be done with no code change in core with a SearchRequestProcessor injected via a plugin.

More broadly, I think we should settle on one abstraction that we use to associate requests with tenants in multi-tenant system. I still like the idea of "named entrypoints" (what @peternied called a "view"). I imagine them as different entrances to the OpenSearch amusement park, where using a given entrance requires some particular credentials, and each entrance gives you a different-colored wristband that determines what indices you query, what kind of queries you are allowed to run, and whether you'll get kicked out of the park if it gets too busy. (We can also track the top N queries for each color of wristband.)

kaushalmahi12 · 2024-04-03T17:54:43Z

Thanks @msfroh ! for taking the time to go through this. We are also assigning a label to these requests and passing this on to corresponding tasks in the flow. I like the idea of tagging the requests through a Plugin.

Regarding the new service for tracking the task resource usage I think we have to have a periodically running thread to calculate the active resource usage of sandboxes/queryGroups/someBetterName.

Do you have any additional thoughts on overcoming this ?

stephen-crawford

Left some questions and comments. Overall looks like a strong framework. We will definitely want to add some tests before looking to actually merge any of this.

stephen-crawford · 2024-04-04T15:06:54Z

server/src/main/java/org/opensearch/action/search/SearchShardTask.java

@@ -50,9 +51,10 @@
 * @opensearch.api
 */
 @PublicApi(since = "1.0.0")
-public class SearchShardTask extends CancellableTask implements SearchBackpressureTask {
+public class SearchShardTask extends CancellableTask implements SearchBackpressureTask, SandboxableTask {


By making the SanboxableTask an interface like this it would imply there are numerous tasks which may eventually support sandboxing (or whatever it ends up being called); do you have some other ideas around what this may look like?

I think that may help with defining the interface more accurately. Personally, I had thought it was only for search so I could have been misunderstanding things.

We were thinking that this feature can supersede the AdmissionControl and SearchBackpressure, We can extend this idea to indexing traffic as well.

stephen-crawford · 2024-04-04T16:57:19Z

server/src/main/java/org/opensearch/search/sandbox/RequestSandboxClassifier.java

+     * @param request is a coordinator search request
+     * @return matching sandboxId based on request attributes
+     */
+    public String resolveSandboxFor(final SearchRequest request) {


I know this is just a draft, but I am not sure how this will work at the moment. Can you clarify your plans for managing this?

I would likely expect the search request to be able to identify its own sandbox if it is extending the sandboxabletask interface. Otherwise I think we would not need to extend it?

This will likely get merged into search pipelines now, Given that we are doing nothing but tagging these requests.

To briefly describe this, We will do the tagging (one time process) of incoming search requests at co-ordinator node and propagate this tagged information to shard level requests as well, this will help us group the requests into these different buckets/sandboxes

stephen-crawford · 2024-04-04T16:57:51Z

server/src/main/java/org/opensearch/search/sandbox/SandboxStatsHolder.java

+/**
+ * Main class to hold sandbox level stats
+ */
+public class SandboxStatsHolder {


Can you provide a couple examples of the stats you expect to track here?

We will track the following per sandbox

runningTaskCount

cancelledTaskCount

rejectedTaskCount

currentResourceUsage

completedTaskCount

server/src/main/java/org/opensearch/search/sandbox/SandboxedRequestTrackerService.java

stephen-crawford · 2024-04-04T17:07:10Z

server/src/main/java/org/opensearch/search/sandbox/SandboxedRequestTrackerService.java

+        return null;
+    }
+
+    private AbsoluteResourceUsage calculateAbsoluteResourceUsageFor(Task task) {


Are we always able to access this information on all distributions?

Yes, This will give us approximately correct resource usage for a task.

peternied

@kaushalmahi12 I appreciate the effort to put forth more changes towards this space thanks for the time you've put into this change.

Three maintainers, myself, @msfroh and @andrross are voicing concern about this feature. There has been considerable dialog in the RFC calling out how to move forward. I'm not sure how this pull request is related to those questions - as the concerns are not implementation specific, but user experience and overall design related.

I do not see how the feature can move forward with smaller RFCs or POC focused around the user experience over all design. Here is what is in my mind as the most burning questions.

How does a sysadmin know that they should use this feature?
- Why aren't sysadmins creating separate OpenSearch clusters?
- Why isn't admission control enough?
What is the impact on OpenSearch clients this feature is enabled?
- Return errors such as 429s or delay responses in a queue?
- Are there additional query parameters that are required?
How do sys admins know the feature is working as expected?
- How is routing/filtering verifiable?
- Are existing health metrics still viable, does health need to be considered different for different sandboxes?

These each sound like 3 or more RFCs that an be driven to consensus with the project's maintainers. Please work together with the community towards alignment and shared understanding.

kaushalmahi12 · 2024-04-18T19:42:21Z

How does a sysadmin know that they should use this feature?

We will need to educate the users through blogs, Usecase: If the admins are facing frequent regressions in the cluster due to a group of misbehaving queries, then this is a suitable construct to take actions against such search requests without compromising the stability of the cluster. Although this is very basic use case but this can evolve into something along the workload management in Opensearch.
I think we have covered the User experience in detail in the googleDoc

Why aren't sysadmins creating separate OpenSearch clusters?

Can you elaborate more on this ?

Why isn't admission control enough?

Admission Control works is the gatekeeping mechanism and doesn't control the lifecycle of running tasks.

What is the impact on OpenSearch clients this feature is enabled?

Based on the sandbox definitions, Users are expected to see 429s under load.

Are there additional query parameters that are required?

No.

How do sys admins know the feature is working as expected?

We will expose the stats API for this to see various metrices around sandbox, e.g; running tasks, completedTasks etc;

Are existing health metrics still viable, does health need to be considered different for different sandboxes?

Since the resource constraints for sandboxes might be different, Hence the health metrices for the sandboxes depends on the current resource usage for the sandbox and defined limits.

kaushalmahi12 · 2024-04-18T22:06:58Z

I am closing this PR as We are going to move few things into plugin based on my discussion with @msfroh.
I will update the RFC with the details around this.

kaushalmahi12 added 3 commits April 1, 2024 17:22

add QSB framework related changes

270d409

Signed-off-by: Kaushal Kumar <[email protected]>

resolve merge conflicts

f94504d

Signed-off-by: Kaushal Kumar <[email protected]>

add validity checks for settings

239e5ec

Signed-off-by: Kaushal Kumar <[email protected]>

Merge branch 'main' into qsb-framework-changes

a4f6c80

Signed-off-by: Kaushal Kumar <[email protected]>

stephen-crawford reviewed Apr 4, 2024

View reviewed changes

peternied requested changes Apr 9, 2024

View reviewed changes

kaushalmahi12 closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qsb framework changes #13007

Qsb framework changes #13007

kaushalmahi12 commented Apr 2, 2024

github-actions bot commented Apr 2, 2024 •

edited

Loading

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 2, 2024

andrross commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

andrross commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

msfroh commented Apr 2, 2024 •

edited

Loading

kaushalmahi12 commented Apr 3, 2024

stephen-crawford left a comment

stephen-crawford Apr 4, 2024

kaushalmahi12 Apr 8, 2024

stephen-crawford Apr 4, 2024

kaushalmahi12 Apr 8, 2024

stephen-crawford Apr 4, 2024

kaushalmahi12 Apr 8, 2024

stephen-crawford Apr 4, 2024

kaushalmahi12 Apr 8, 2024

peternied left a comment

kaushalmahi12 commented Apr 18, 2024 •

edited

Loading

kaushalmahi12 commented Apr 18, 2024 •

edited by msfroh

Loading

Qsb framework changes #13007

Qsb framework changes #13007

Conversation

kaushalmahi12 commented Apr 2, 2024

Description

Related Issues

Check List

github-actions bot commented Apr 2, 2024 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 2, 2024

andrross commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 • edited Loading

andrross commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 • edited Loading

msfroh commented Apr 2, 2024 • edited Loading

kaushalmahi12 commented Apr 3, 2024

stephen-crawford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied left a comment

Choose a reason for hiding this comment

kaushalmahi12 commented Apr 18, 2024 • edited Loading

kaushalmahi12 commented Apr 18, 2024 • edited by msfroh Loading

github-actions bot commented Apr 2, 2024 •

edited

Loading

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

msfroh commented Apr 2, 2024 •

edited

Loading

kaushalmahi12 commented Apr 18, 2024 •

edited

Loading

kaushalmahi12 commented Apr 18, 2024 •

edited by msfroh

Loading