[META] Improve Hybrid query latency #704

martin-gaievski · 2024-04-23T23:48:12Z

Hybrid query has high latency comparing to other compound queries like Boolean query. Based on results collected for 2.13 and depending on the dataset and exact query it may be up to 10 times slower than Bool. Another reason for this issue is degradation in performance of hybrid query comparing to initial release e.g. in OpenSearch 2.11.

Following are goals for this work:

bring performance of hybrid query to a level when it's comparable with bool query:
For small datasets and sub-sets it should much Bool with deviation within 20% for p90
For large datasets (10M+ documents) and if a sub-queries return large sub-set of documents (1M+ documents in sub-query result) hybrid query should perform no worse than 2x of Bool query
Multiple sub-queries can add additional overhead of no more than 20% of overall query time for p90
reach the level of performance of hybrid query released in 2.11

There were some GH issues in the past that are related to the same problem, e.g. #281. In addition to that, based on analysis of the source code and some profiling I can think of following list of items:

don't execute TopDocsCollector core collector as it takes compute and results are ignored
optimize plugin code for better performance: check for sub-optimal initializations, loops, type conversions etc.
for cases when some of sub-queries are rewritten to the same lucene form - execute only one query and copy scores

Github issues for each child item:

[FEATURE] Implement parallel execution of sub-queries for hybrid search #279 parallel execution of sub-queries
In hybrid query replace Java Stream calls with a faster alternative #705 replace streaming API calls
[Feature Request] Provide capability for not adding top docs collector in the query search path OpenSearch#13170 allow empty query collector context (skip TopDocsCollector)
In hybrid query allow to skip parallel score collection by core TopDocsCollector #729 enable empty query collector context for hybrid query scenario
In hybrid query optimize the way we iterate over results and collect scores of sub queries #745 optimize the way we iterate over results and collect scores of sub queries

martin-gaievski added untriaged enhancement hybrid search hybrid query performance optimization and removed untriaged labels Apr 23, 2024

martin-gaievski changed the title ~~Improve Hybrid query latency~~ [META] Improve Hybrid query latency Apr 23, 2024

This was referenced Apr 26, 2024

Removed map of subquery to subquery index in favor of storing index as part of DISI wrapper to improve hybrid query latencies by 20% #711

Merged

Pass empty QueryCollectorContext in case of hybrid query to improve latencies by 20% #731

Merged

martin-gaievski mentioned this issue May 10, 2024

Use lazy initialization for priority queue of hits and scores to improve latencies by 20% #746

Merged

2 tasks

martin-gaievski mentioned this issue May 23, 2024

[BUG] Total hits count mismatch in Hybrid Query #756

Closed

martin-gaievski mentioned this issue Jun 10, 2024

[PROPOSAL] Advanced Optimization Techniques for Hybrid query #783

Open

martin-gaievski mentioned this issue Jun 21, 2024

[Blog] Boosting Hybrid Query Performance in OpenSearch 2.15 opensearch-project/project-website#3001

Closed

naveentatikonda closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Improve Hybrid query latency #704

[META] Improve Hybrid query latency #704

martin-gaievski commented Apr 23, 2024 •

edited

Loading

[META] Improve Hybrid query latency #704

[META] Improve Hybrid query latency #704

Comments

martin-gaievski commented Apr 23, 2024 • edited Loading

martin-gaievski commented Apr 23, 2024 •

edited

Loading