Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qsb framework changes #13007

Conversation

kaushalmahi12
Copy link
Contributor

This PR is to add skeleton changes to support the search task tracking under sandboxes/queryGroups. This is the very first PR where I have intentionally left out the testing part and want the feedback from the reviewers on the interfaces/classes front (LLD). Once the I get the green flag on the LLD part, We can fill in the implementation details.

Description

[Describe what this change achieves]

Related Issues

Query Sandboxing Feature

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

Compatibility status:

Checks if related components are compatible with change a4f6c80

Incompatible components

Skipped components

Compatible components

Copy link
Contributor

github-actions bot commented Apr 2, 2024

❌ Gradle check result for a4f6c80: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 2, 2024

❌ Gradle check result for 239e5ec: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member

andrross commented Apr 2, 2024

Thanks @kaushalmahi12. The linked RFC has a lot of discussion, but I don't see consensus on a path forward. The one thing for which it seems like the is consensus is not using the word "sandbox" for this feature.

I would echo @peternied's feedback on the RFC: create a minimal proof-of-concept that shows what this feature will do and how users/admins will interact with it. It's really hard to comment on a skeleton when we don't have agreement on the high level feature.

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Apr 2, 2024

Thanks @kaushalmahi12. The linked RFC has a lot of discussion, but I don't see consensus on a path forward. The one thing for which it seems like the is consensus is not using the word "sandbox" for this feature.

I would echo @peternied's feedback on the RFC: create a minimal proof-of-concept that shows what this feature will do and how users/admins will interact with it. It's really hard to comment on a skeleton when we don't have agreement on the high level feature.

Thanks @andrross for looking at this and providing your useful feedback, If the naming is the only concern we can come to a agreed upon naming convention for this feature and same can be reflected here.

I didn't get it from the discussion on the RFC that majority of the folks were aligned on creating a POC first for this. If that is something to move forward with, we can definitely work from that point as well.

@andrross
Copy link
Member

andrross commented Apr 2, 2024

My takeaway from the RFC discussion is that there is no clear consensus around this feature. The POC suggestion is just one approach to help build that consensus. It's certainly not the only way to go though. The main point here is that without clear consensus toward a goal then you're not likely to get much useful feedback on a skeleton PR like this.

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Apr 2, 2024

Got it!, @andrross
What is the minimal expectation from POC ? I think our main concern here is, how would it work from user's point of view. I thought We had covered that in the community meeting and the google doc here

Regarding consensus aspect, I would really appreciate your help if you could help me with actionable feedback on how can we should go about it ? Although I have replied to the last few comments here.

Can you also provide your feedback on the RFC ?

@msfroh
Copy link
Collaborator

msfroh commented Apr 2, 2024

? I think your main concern was, how would it work from user's point of view. I thought We had covered that in the community meeting and the google doc here

I remember from the community meeting that there were a lot of concerns about the approach, and especially the large number of new entities, concepts, abstractions, and settings being created.

For example, this PR adds a new service, a new task interface, and new search logic injected into core.

By comparison, I think associating a search request with a given "sandbox" (really, I think "tagging" or "labeling" queries is a better abstraction) could be done with no code change in core with a SearchRequestProcessor injected via a plugin.

More broadly, I think we should settle on one abstraction that we use to associate requests with tenants in multi-tenant system. I still like the idea of "named entrypoints" (what @peternied called a "view"). I imagine them as different entrances to the OpenSearch amusement park, where using a given entrance requires some particular credentials, and each entrance gives you a different-colored wristband that determines what indices you query, what kind of queries you are allowed to run, and whether you'll get kicked out of the park if it gets too busy. (We can also track the top N queries for each color of wristband.)

@kaushalmahi12
Copy link
Contributor Author

Thanks @msfroh ! for taking the time to go through this. We are also assigning a label to these requests and passing this on to corresponding tasks in the flow. I like the idea of tagging the requests through a Plugin.

Regarding the new service for tracking the task resource usage I think we have to have a periodically running thread to calculate the active resource usage of sandboxes/queryGroups/someBetterName.

Do you have any additional thoughts on overcoming this ?

Copy link
Contributor

@stephen-crawford stephen-crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions and comments. Overall looks like a strong framework. We will definitely want to add some tests before looking to actually merge any of this.

@@ -50,9 +51,10 @@
* @opensearch.api
*/
@PublicApi(since = "1.0.0")
public class SearchShardTask extends CancellableTask implements SearchBackpressureTask {
public class SearchShardTask extends CancellableTask implements SearchBackpressureTask, SandboxableTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making the SanboxableTask an interface like this it would imply there are numerous tasks which may eventually support sandboxing (or whatever it ends up being called); do you have some other ideas around what this may look like?

I think that may help with defining the interface more accurately. Personally, I had thought it was only for search so I could have been misunderstanding things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were thinking that this feature can supersede the AdmissionControl and SearchBackpressure, We can extend this idea to indexing traffic as well.

* @param request is a coordinator search request
* @return matching sandboxId based on request attributes
*/
public String resolveSandboxFor(final SearchRequest request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a draft, but I am not sure how this will work at the moment. Can you clarify your plans for managing this?

I would likely expect the search request to be able to identify its own sandbox if it is extending the sandboxabletask interface. Otherwise I think we would not need to extend it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely get merged into search pipelines now, Given that we are doing nothing but tagging these requests.

To briefly describe this, We will do the tagging (one time process) of incoming search requests at co-ordinator node and propagate this tagged information to shard level requests as well, this will help us group the requests into these different buckets/sandboxes

/**
* Main class to hold sandbox level stats
*/
public class SandboxStatsHolder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a couple examples of the stats you expect to track here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will track the following per sandbox

  • runningTaskCount
  • cancelledTaskCount
  • rejectedTaskCount
  • currentResourceUsage
  • completedTaskCount

return null;
}

private AbsoluteResourceUsage calculateAbsoluteResourceUsageFor(Task task) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always able to access this information on all distributions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, This will give us approximately correct resource usage for a task.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushalmahi12 I appreciate the effort to put forth more changes towards this space thanks for the time you've put into this change.

Three maintainers, myself, @msfroh and @andrross are voicing concern about this feature. There has been considerable dialog in the RFC calling out how to move forward. I'm not sure how this pull request is related to those questions - as the concerns are not implementation specific, but user experience and overall design related.

I do not see how the feature can move forward with smaller RFCs or POC focused around the user experience over all design. Here is what is in my mind as the most burning questions.

  • How does a sysadmin know that they should use this feature?
    • Why aren't sysadmins creating separate OpenSearch clusters?
    • Why isn't admission control enough?
  • What is the impact on OpenSearch clients this feature is enabled?
    • Return errors such as 429s or delay responses in a queue?
    • Are there additional query parameters that are required?
  • How do sys admins know the feature is working as expected?
    • How is routing/filtering verifiable?
    • Are existing health metrics still viable, does health need to be considered different for different sandboxes?

These each sound like 3 or more RFCs that an be driven to consensus with the project's maintainers. Please work together with the community towards alignment and shared understanding.

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Apr 18, 2024

How does a sysadmin know that they should use this feature?

We will need to educate the users through blogs, Usecase: If the admins are facing frequent regressions in the cluster due to a group of misbehaving queries, then this is a suitable construct to take actions against such search requests without compromising the stability of the cluster. Although this is very basic use case but this can evolve into something along the workload management in Opensearch.
I think we have covered the User experience in detail in the googleDoc

Why aren't sysadmins creating separate OpenSearch clusters?

Can you elaborate more on this ?

Why isn't admission control enough?

Admission Control works is the gatekeeping mechanism and doesn't control the lifecycle of running tasks.

What is the impact on OpenSearch clients this feature is enabled?

Based on the sandbox definitions, Users are expected to see 429s under load.

Are there additional query parameters that are required?

No.

How do sys admins know the feature is working as expected?

We will expose the stats API for this to see various metrices around sandbox, e.g; running tasks, completedTasks etc;

Are existing health metrics still viable, does health need to be considered different for different sandboxes?

Since the resource constraints for sandboxes might be different, Hence the health metrices for the sandboxes depends on the current resource usage for the sandbox and defined limits.

@kaushalmahi12
Copy link
Contributor Author

kaushalmahi12 commented Apr 18, 2024

I am closing this PR as We are going to move few things into plugin based on my discussion with @msfroh.
I will update the RFC with the details around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants