Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Rescore queries do not seem to modify relevance scores when used in conjunction with a hybrid query #914

Closed
arasraj opened this issue Sep 23, 2024 · 4 comments
Assignees

Comments

@arasraj
Copy link

arasraj commented Sep 23, 2024

What is the bug?

When the query dsl contains a hybrid section as well as a rescore section, the rescore query does not seem to have any effect on the final doc relevance scores.

How can one reproduce the bug?

Add phase result processors:

PUT /_search/pipeline/hybrid-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.4,
              0.6
            ]
          }
        }
      }
    }
  ]
}

Example query:

{
  "query": {
    "hybrid": {
      "queries": [
        {
          "bool": {
            "must": [
              {
                "match": {
                  "title": {
                    "query": "new"
                  }
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match": {
                  "title": {
                    "query": "york"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "rescore": [
    {
      "query": {
        "score_mode": "multiply",
        "rescore_query": {
          "function_score": {
            "functions": [
              {
                "exp": {
                  "pub_time_utc": {
                    "origin": 1724976000,
                    "scale": "1d"
                  }
                }
              }
            ]
          }
        }
      }
    }
  ]
}

What is the expected behavior?

It is expected that the rescore query actually modifies the scores of the sub-query result docs prior to the coordinating node doing score normalization and merging. For the example query above, it is expected that the results from the sub-queries are reordered according to the decay function.

What is your host/environment?

AWS managed Opensearch 2.15

@arasraj arasraj added bug Something isn't working untriaged labels Sep 23, 2024
@minalsha minalsha added Features Introduces a new unit of functionality that satisfies a requirement enhancement and removed bug Something isn't working untriaged Features Introduces a new unit of functionality that satisfies a requirement labels Sep 23, 2024
@martin-gaievski
Copy link
Member

Hi @arasraj thank you for posting this into the issue.

We’ve tested the scenario you described and confirmed that the rescore functionality does not get executed for hybrid queries. Our team is currently investigating potential code changes to support rescore queries in this context.

In the meantime, I can suggest a workaround: run hybrid search queries separately and combine the scores outside of OpenSearch. You can include the rescore query as part of each individual query. For score normalization, you have two options:

Run all queries, retrieve the max and min scores for each query, then calculate a normalization factor (multiplier) to bring the scores into a [0..1.0] range. Apply this multiplier at the client level.
Alternatively, run each query with size=0 to get the max score for each. Use this to compute a normalization factor. For example, if one query has a max score of 5.0 and another 0.8, you would use a multiplier of 6.25. You can then add this multiplier as a boost parameter in the query to bring scores to the same scale.

For both methods, note that you’ll need to use a larger size than in a typical hybrid query. This is because, internally, hybrid queries retrieve up to size * number_of_shards documents and then apply normalization to return the top results. In these approaches, however, you’re only getting size documents. I recommend starting with size = 10x your usual value.

@minalsha
Copy link
Collaborator

Thank you @martin-gaievski for sharing the details.

HI @arasraj : Thank you for reporting this issue. @martin-gaievski is actively deep diving into this issue.

@arasraj
Copy link
Author

arasraj commented Sep 23, 2024

great! thanks for looking into it.

@martin-gaievski i can test how these alternative methods will impact latency, but do you have any guesses? We currently have ~50ms latency budget and wouldn't want a 10x retrieval size or running a query with size=0 to add too much extra latency on top of what we have.

@martin-gaievski
Copy link
Member

martin-gaievski commented Sep 24, 2024

Exact latency will be depending on the dataset and exact queries, my guess it will go above the 50ms mainly because of the additional latency of multiple query layers.
But one thing where you can have a gain in performance is parallel query execution. Currently hybrid query executes sub-queries in sequential manner, but if you have control on the client side it can be done in parallel, and overall latency will be max of latencies for each sub-query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants