[META] QueryGroup level stats structure and api #15120

kaushalmahi12 · 2024-08-05T22:23:45Z

Please describe the end goal of this project

We want to come to a conclusion on all the fields that should be present in query group stats. Sine the _node/stats is already quite large and given that we will have [0-100] query groups available in the cluster.

Though the number of query groups are limited but these query groups account node level metrices pertaining to a query group, hence actual number of such stat objects in the cluster will be #dataNodes * #queryGroups.

If the cluster is large e,g; dataNodes = 200 and queryGroups = 50 then these objects will be 10000 which is a lot and can potentially make the stats output hard to fathom.

The schema for the single query group stat I am proposing is something like the following

{
    "queryGroupId": {
            "completions": Long,
            "rejections": Long,
            "CPU": { "currentUsage": Double, "cancellations": Long },
            "Memory": { "currentUsage": Double, "cancellations": Long }
}

apart from resource usage in this all the metric values are cumulative counters since the process start time.

Keeping these things in mind I am more inclined towards keeping the stats API for this separate.
I think the feature related stats should only be provided when explicitly asked either using

queryParam
PathParam
But currently if a feature is enabled and has stats then they are returned by default. If a client is consuming the node/stats and then upgrades to a OS version which has additional stats object present in the response it can break the client code.

Supporting References

#12342

Issues

#12342

Related component

Search:Resiliency

The text was updated successfully, but these errors were encountered:

jainankitk · 2024-08-07T16:00:45Z

Thanks @kaushalmahi12 for initiating this issue. Few considerations:

Should we also have failures in the queryGroup stats, since we don't track usage for those requests?
Also, should we distinguish between the stats between SearchTask and SearchShardTask?
I am slightly unclear on if you're suggesting for the stats to be part of nodestats and only returned using queryParam/pathParam. OR Have these stats as part of separate stats API

kaushalmahi12 · 2024-08-12T17:46:59Z

@jainankitk Thanks for reviewing it.

I am slightly unclear on if you're suggesting for the stats to be part of nodestats and only returned using queryParam/pathParam. OR Have these stats as part of separate stats API

Yes, I am suggesting to have separate API.

Should we also have failures in the queryGroup stats, since we don't track usage for those requests?

What do we mean by failures here ?

Also, should we distinguish between the stats between SearchTask and SearchShardTask?

IMO this will be unnecessary and redundant fields in the API, Why do we think that these will be useful w.r.t a QueryGroup ?

@backslasht Can you provide your inputs on this ?

kaushalmahi12 · 2024-08-15T04:22:54Z

@jainankitk If mean request/task failures then I have the following reasons to not include those in the stats.

Since the rejection and cancellation are due to query group resource usage while failures are transient or non related to query group limits itself. It should not come under stats for the queryGroup
The failure hooks are not in built in the task resource tracking framework. Just for the sake of adding failures to the stats I think this will be an overkill given the value add

backslasht · 2024-08-15T06:43:52Z

Thanks @kaushalmahi12 for creating this issue.

+1 for creating a new API instead of adding it to node stats. I'm interested in knowing what are the filters you are thinking to provide as part of this new API. Request/response structures will be helpful to understand it.

Since the rejection and cancellation are due to query group resource usage while failures are transient or non related to query group limits itself. It should not come under stats for the queryGroup

While I agree that failures may not be due to resource consumption, it does provide clarity for the user to understand which query groups are seeing failures especially if those query groups belongs to multiple tenants.

kaushalmahi12 · 2024-08-16T19:20:37Z

Thanks @backslasht ! for going through this and providing your useful insights.

I'm interested in knowing what are the filters you are thinking to provide as part of this new API. Request/response structures will be helpful to understand it.

Regarding this We can keep the API aligned with nodes_/stats Api and provide path param based filtering
e,g;

GET query_group/stats
GET query_group/<node_id>/stats
GET query_group/stats/<metrices>

Here the metrices would be comma separated list of fields that we want to get as part of response
Some concrete examples could be as following

GET query_group/stats

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343,
            "CPU": {
                "current_usage": 0.4,
                "cancellations": 10
            },
            "MEMORY": {
                "current_usage": 0.2,
                "cancellations": 101
            }
        }
        .....
    }

    .....
}

GET query_group/F-ByTQzVQ3GQeYzQJArJGQ/stats

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343,
            "CPU": {
                "current_usage": 0.4,
                "cancellations": 10
            },
            "MEMORY": {
                "current_usage": 0.2,
                "cancellations": 101
            }
        }
        .....
    }
}

GET query_group/stats/completions,rejections

Response

{
    "F-ByTQzVQ3GQeYzQJArJGQ": {
        "P7D7VE4fe7": {
            "completions": 12343134,
            "rejections": 343
        },
        "RPqvjXFyiP": {
            "completions": 98424211,
            "rejections": 352
        }
        .....
    }
    ....
}

it does provide clarity for the user to understand which query groups are seeing failures especially if those query groups belongs to multiple tenants.

I think failures from a multi-tenancy perspective does make sense to incorporate. But I think we are extending this idea to do the task scheduling using threadpool per query group then in future it might also make sense the currently running tasks and queue size for the tenant.

Keeping these things in mind I think it shouldn't hurt to add failures into the stats.

jainankitk · 2024-08-19T17:17:31Z

@kaushalmahi12 - Thanks for providing the request response structures. Few comments:

Might be good to add whether QueryGroup is breaching threshold or not. Not sure if we should add specific resource level breach indicator as well
Do we need cumulative cancellations metric in addition to per resource? I am assuming cumulative rejections is for tracking requests we did not even start execution
Maybe for simplicity, we need not support query_group/stats/completions,rejections. Although, we should take some filtering parameters like, breach=true for looking at just the QueryGroups breaching one of the resource threshold. We can further that by taking breach=memory to get the QueryGroups breaching memory threshold.

I think failures from a multi-tenancy perspective does make sense to incorporate. But I think we are extending this idea to do the task scheduling using threadpool per query group then in future it might also make sense the currently running tasks and queue size for the tenant.

I also agree with including failures as one of the metric and running tasks/queue size and other relevant metrics once we introduce QueryScheduling using QueryGroups

backslasht · 2024-08-20T15:22:31Z

Thanks @kaushalmahi12 for the request and response structures. It will be good to add filters using query group id/name as well.

kaushalmahi12 · 2024-08-21T18:35:11Z

Thanks @jainankitk for follow up suggestions.

Might be good to add whether QueryGroup is breaching threshold or not. Not sure if we should add specific resource level breach indicator as well

I think filtering based on breach filter seems like a good idea and user experience. But regarding passing the breach parameter is little ambiguous since we will be performing rejections/cancellations so should be declare breach at cancellation threshold or rejection.

Do we need cumulative cancellations metric in addition to per resource? I am assuming cumulative rejections is for tracking requests we did not even start execution

IMO this will be redundant information as the structure is simple enough and not too many nested subfields.

Maybe for simplicity, we need not support query_group/stats/completions,rejections. Although, we should take some filtering parameters like, breach=true for looking at just the QueryGroups breaching one of the resource threshold. We can further that by taking breach=memory to get the QueryGroups breaching memory threshold.

I like this idea and we should support it.

kaushalmahi12 added Meta Meta issue, not directly linked to a PR untriaged labels Aug 5, 2024

jainankitk added the Search Search query, autocomplete ...etc label Aug 7, 2024

getsaurabh02 removed the untriaged label Aug 7, 2024

kaushalmahi12 mentioned this issue Aug 13, 2024

[Workload Management] QueryGroup Resource Cancellation #15151

Closed

3 tasks

This was referenced Aug 21, 2024

Add query group stats related structures #15337

Closed

Add query group stats related structures #15338

Closed

add query group stats constructs #15343

Merged

kaushalmahi12 mentioned this issue Aug 30, 2024

add request failure listener to track failures at query group level #15527

Merged

3 tasks

jainankitk closed this as completed in #15527 Aug 30, 2024

kaushalmahi12 mentioned this issue Sep 8, 2024

[Workload Management] QueryGroup Stats API Logic #15777

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] QueryGroup level stats structure and api #15120

[META] QueryGroup level stats structure and api #15120

kaushalmahi12 commented Aug 5, 2024 •

edited

Loading

jainankitk commented Aug 7, 2024

kaushalmahi12 commented Aug 12, 2024

kaushalmahi12 commented Aug 15, 2024 •

edited

Loading

backslasht commented Aug 15, 2024

kaushalmahi12 commented Aug 16, 2024 •

edited

Loading

jainankitk commented Aug 19, 2024

backslasht commented Aug 20, 2024

kaushalmahi12 commented Aug 21, 2024

[META] QueryGroup level stats structure and api #15120

[META] QueryGroup level stats structure and api #15120

Comments

kaushalmahi12 commented Aug 5, 2024 • edited Loading

Please describe the end goal of this project

Supporting References

Issues

Related component

jainankitk commented Aug 7, 2024

kaushalmahi12 commented Aug 12, 2024

kaushalmahi12 commented Aug 15, 2024 • edited Loading

backslasht commented Aug 15, 2024

kaushalmahi12 commented Aug 16, 2024 • edited Loading

jainankitk commented Aug 19, 2024

backslasht commented Aug 20, 2024

kaushalmahi12 commented Aug 21, 2024

kaushalmahi12 commented Aug 5, 2024 •

edited

Loading

kaushalmahi12 commented Aug 15, 2024 •

edited

Loading

kaushalmahi12 commented Aug 16, 2024 •

edited

Loading