Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] sltr queries with minimum_should_match features #20

Open
jhinch-at-atlassian-com opened this issue Nov 3, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@jhinch-at-atlassian-com
Copy link

Is your feature request related to a problem?

Non-linear scoring functions, particularly gradient boost decisions trees can be used as a technique used to deal with combining scores together for features which have different magnitudes and score distributions. However, currently sltr queries functions similar to bool query with a minimum_should_match of 0 with a custom scoring function meaning it cannot be used conveniently within the initial query and currently is encouraged to only be used in rescore blocks.

For example given the following featureset definition:

{
  "featurset": {
    "features": [
      {
        "name": "title_text_match",
        "params": [
          "query_text"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{query_text}}"
          }
        }
      },
      {
        "name": "description_text_match",
        "params": [
          "query_text"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "description": "{{query_text}}"
          }
        }
      },
      {
        "name": "description_knn_match",
        "params": [
          "query_embedding"
        ],
        "template_language": "mustache",
        "template": "{\"knn\":{\"description_vector\":{\"k\":10,\"vector\":{{#toJson}}query_embedding{{/toJson}}}}}"
      }
    ]
  }
}

and a model example_model which was created using the above featureset, the following sltr query:

{
  "sltr": {
    "model": "example_model",
    "params": {
      "query_text": "the text query",
      "query_embedding": [1.0, 0.4, ...]
     }
  }
}

Can be thought conceptually as:

{
  "bool": {
    "filter": {
      "match_all": {}
    },
    "should": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "minimum_should_match": 0,
    // plus also use a special scoring function defined by example_model
  }
}

What solution would you like?

It would be great if the features used by the model could have a requirement of a minimum which should match so that the sltr:

{
  "sltr": {
    "model": "example_model",
    "params": {
      "query_text": "the text query",
      "query_embedding": [1.0, 0.4, ...]
     },
     "minimum_should_match": 1
  }
}

which would translates to roughly the following:

{
  "bool": {
    "should": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "minimum_should_match": 1,
    // plus also use a special scoring function defined by example_model
  }
}

What alternatives have you considered?

Its possible to work around this by having a surrounding bool query and duplicate the features as filters in that bool query:

{
  "bool": {
    "filter": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "should": {
      {
        "sltr": {
          "model": "example_model",
          "params": {
            "query_text": "the text query",
            "query_embedding": [1.0, 0.4, ...]
           }
        }
      }
    }
  }
}

However this has the problem that it executes the query blocks twice and it requires duplicating the definitions and ensuring the featureset and query remain in sync.

Do you have any additional context?

This is the equivalent feature request as o19s/elasticsearch-learning-to-rank#476 but to the OpenSearch fork.

@msfroh
Copy link

msfroh commented Nov 8, 2023

We need to better understand how the sltr query is implemented. We have only just begun to explore the LTR plugin.

@jhinch-at-atlassian-com -- do you have any ideas of how sltr is implemented under the hood to help us get started?

@noCharger -- Can you look into this? Would be a good place to get started on understanding the plugin. Thanks!

@jhinch-at-atlassian-com
Copy link
Author

The best place to start looking is from RankerQuery.RankerWeight#scorer and RankerQuery.DisjunctionDISI#advance. You would need to compare this to how the equivalent functionality in bool query works. Likely what would need to be done to make it work is to inspect the subIteratorsPriorityQueue when advance is called and consider how many sub iterators are at the next doc ID allowing it to skip over scoring documents which don't match.

@msfroh msfroh removed the untriaged label Nov 10, 2023
@noCharger
Copy link

@jhinch-at-atlassian-com I like this plan and the approach we're taking to support minimum_should_match. Would you like to contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

3 participants