Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for combined_fields (BM25F) #3996

Open
SViradiya-MarutiTech opened this issue Jul 25, 2022 · 23 comments
Open

Add support for combined_fields (BM25F) #3996

SViradiya-MarutiTech opened this issue Jul 25, 2022 · 23 comments
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request help wanted Extra attention is needed Search Search query, autocomplete ...etc

Comments

@SViradiya-MarutiTech
Copy link

Use Case

Currently when i want to search number field with provided free search text using multi_match, I get number_format_exception. what we checked in latest version of Elastic Search(7.17) it is possible to search using combined_fields. As AWS does not support combined_fields, we can not use combined_fields and as multi_match has problem with number field, we would not be able to upgrade our AWS OpenSearch, Lets take example.

in my_index document, year field is number, When I wanted to search like below:

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "2011 Toyota Corolla",
      "type": "cross_fields",
      "fields": [
        "year",
        "make",
        "model"
      ]
    }
  }
}

I got 400 Error:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: For input string: \"2012 Toyota Corolla\"",
                "index_uuid": "dkHYbCuLSBKzHcwN3yQx6g",
                "index": "instant_offer"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "instant_offer",
                "node": "-X63-ynlS92dLxcUHMoyjA",
                "reason": {
                    "type": "query_shard_exception",
                    "reason": "failed to create query: For input string: \"2012 Toyota Corolla\"",
                    "index_uuid": "dkHYbCuLSBKzHcwN3yQx6g",
                    "index": "instant_offer",
                    "caused_by": {
                        "type": "number_format_exception",
                        "reason": "For input string: \"2012 Toyota Corolla\""
                    }
                }
            }
        ]
    },
    "status": 400
}

When I remove year field from array of "fields", I got 200 Status code.

Feature Request

I would like to see combined_fields support in latest OpenSearch. So that we can resolve our problem.

GET my_index/_search
{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year",
        "make",
        "model"
      ]
    }
  }
}
@macrakis
Copy link

macrakis commented Aug 16, 2022

Agreed, supporting BM25F aka BM 25F (called combined_fields in Elastic) is very useful. This functionality is based on Lucene's BM25FQuery.

@macohen
Copy link
Contributor

macohen commented Oct 4, 2022

It looks like this is the query we would need to include: https://lucene.apache.org/core/9_4_0/sandbox/org/apache/lucene/sandbox/search/CombinedFieldQuery.html. For something in Lucene sandbox with potential changes in API, I think we would want to consider this part of an experimental release.

@msfroh
Copy link
Collaborator

msfroh commented Oct 4, 2022

It looks like the server already has a dependency on lucene-sandbox: https://github.com/opensearch-project/OpenSearch/blob/main/server/build.gradle#L109

So, I think it makes sense to add this to the core as a new CombinedFieldQueryBuilder under https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/index/query

@msfroh
Copy link
Collaborator

msfroh commented Oct 4, 2022

I'm working on adding an API that supports the following:

{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year^10.0", // Can specify boost
        "make",
        "model"
      ],
      "analyzer": "whitespace" // Optional
    }
  }
}

I'm thinking of applying the following logic:

  1. If an explicit analyzer is specified, we'll pass the query string to the specified analyzer to generate tokens from which we'll extract terms which we'll pass to the Lucene CombinedFieldQuery.Builder.
  2. If no explicit analyzer is specified, we'll iterate through the given fields and ask each analyzer to generate tokens, whose terms will be passed to the CombinedFieldQuery.Builder.

I don't think it makes sense to apply more complex query parsing logic (i.e. using Lucene's QueryBuilder) to the query string, since CombinedFieldsQuery only supports terms.

@msfroh
Copy link
Collaborator

msfroh commented Oct 5, 2022

Here is that logic captured in code:

    @Override
    protected Query doToQuery(QueryShardContext context) throws IOException {
        boolean hasMappedField = fieldBoosts.keySet().stream().anyMatch(k -> context.fieldMapper(k) != null);
        if (hasMappedField == false) {
            return Queries.newUnmappedFieldsQuery(fieldBoosts.keySet());
        }
        CombinedFieldQuery.Builder builder = new CombinedFieldQuery.Builder();
        for (Map.Entry<String, Float> fieldBoost : fieldBoosts.entrySet()) {
            builder.addField(fieldBoost.getKey(), fieldBoost.getValue());
        }
        Analyzer explicitAnalyzer = null;
        if (analyzer != null) {
            explicitAnalyzer = context.getMapperService().getIndexAnalyzers().get(analyzer);
            if (explicitAnalyzer == null) {
                throw new IllegalArgumentException("No analyzer found for [" + analyzer + "]");
            }
        }

        for (String fieldName : fieldBoosts.keySet()) {
            MappedFieldType fieldType = context.fieldMapper(fieldName);
            if (fieldType == null) {
                // ignore unmapped fields
                continue;
            }
            Analyzer fieldAnalyzer;
            if (explicitAnalyzer == null) {
                // Use per-field analyzer
                fieldAnalyzer = context.getSearchAnalyzer(fieldType);
            } else {
                fieldAnalyzer = explicitAnalyzer;
            }
            collectAllTerms(fieldName, fieldAnalyzer, value.toString(), builder);
        }
        return builder.build();
    }

    private static void collectAllTerms(String fieldName, Analyzer analyzer, String queryString,
                                 CombinedFieldQuery.Builder builder) throws IOException {
        TokenStream tokenStream = analyzer.tokenStream(fieldName, queryString);
        TermToBytesRefAttribute termAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            builder.addTerm(BytesRef.deepCopyOf(termAtt.getBytesRef()));
        }
        tokenStream.close();
    }

I still need to write some unit tests before I have a PR ready.

@msfroh
Copy link
Collaborator

msfroh commented Oct 5, 2022

Oh... I should also probably handle the case where no terms are produced, with an optional zero_terms_query parameter.

@Yury-Fridlyand
Copy link

I think we should support the same parameters as elastic has: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/query-dsl-combined-fields-query.html.

msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
msfroh added a commit to msfroh/OpenSearch that referenced this issue Oct 6, 2022
This adds support for the CombinedFieldQuery from the Lucene sandbox.

The supported syntax is as follows:

```
{
  "combined_field": {
    "query" : "quick brown fox", // required
    "fields" : [ // if no fields specified then matches nothing
      "a_text_field", // must be text field, else will be ignored
      "a_text_field_with_weight^5"
    ],
    "analyzer" : "custom_analyzer", // optional
    "zero_terms_query" : "none" //optional
  }
}
```

If no analyzer is specified, terms are derived from the union of terms
from all fields' analyzers. The behavior of zero_terms_query is
like for multi_match.

Fixes:

- opensearch-project#3996

Signed-off-by: Michael Froh <[email protected]>
@msfroh
Copy link
Collaborator

msfroh commented Nov 16, 2022

@SViradiya-MarutiTech -- I've been giving this more thought.

I don't think combined fields will help with your use-case. The error message you're getting there seems to be related to the multi_match trying to delegate query parsing to each of the underlying fields, including the numeric year field (which can't handle the string "2011 Toyota Corolla").

I think your best bet would be add a separate year_text field of type text, and add a copy_to to copy the value from year to year_text. Then you could multi_match on year_text instead.

Matching behavior of combined field and multi_match should be more or less the same, the difference comes from scoring. Where multi_match may give more weight to a value that is less common in a specific matching field, a combined fields query treats all the fields as one big bag of terms for frequency calculation purposes.

@msfroh
Copy link
Collaborator

msfroh commented Dec 15, 2022

As another idea -- I'm also thinking that maybe it would make sense to implement this differently.

Instead of adding a new query type (as Elasticsearch did), I think it might make more sense to implement combined fields scoring within multi_match, behind an option.

This isn't a full-baked idea yet, but I'll think through the cases covered by multi_match and think about how they would behave if we imagine all terms come from one field.

@msfroh msfroh removed the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 5, 2023
@macrakis
Copy link

macrakis commented Jan 5, 2023

Yes, the critical thing about BM25F is that the IDFs are global rather than local to the field.

@msfroh
Copy link
Collaborator

msfroh commented Jan 10, 2023

This article by Nate Day has a pretty good explanation of the difference in behavior between combined_fields and multi_match when minimum_should_match requires that more than one term must match.

On multi_match, the matching terms requirement must be satisfied by one field, whereas combined_fields allows term matches across fields.

The conclusion of that article suggests that using a distinct query type is a good thing to make the changed behavior (from field-centric to term-centric matching) clearer. I'm inclined to agree. That's a pretty good counterargument to my half-baked idea above.

@macohen macohen added help wanted Extra attention is needed Search Search query, autocomplete ...etc labels Mar 23, 2023
@odelmarcelle
Copy link

This feature would be a great addition to OpenSearch. For our use case, we have documents splits into 'title' and 'body'. As the body of the document can greatly vary in length compared to the title, it leads to a drastic overweight of terms in the 'title' compared to the ones in the 'body' for long documents.

Using the the BM25F would lead to much more appropriate results

@calebplum
Copy link

calebplum commented Jun 13, 2023

Are there any updates on this? It would be hugely beneficial to have access to this feature

@zr-gwomark
Copy link

Just want to piggy back on some of the previous comments. I'm currently working on a search migration project from a hombrewed search engine to OpenSearch for a job search engine that looks at title, description, and company name fields. Our old search engine combines both term and field centric approaches by treating these fields as one field but at index time we can assign weights to each of these fields to control how much each field contributes to the term frequency for a particular term. With best_fields we can achieve a field centric approach but we can't treat IDF as the same for all three fields, additionally with cross_fields we can treat all three fields as having the same idf, but the description field is quite long compared to the other fields and ends up dominating the term frequency scores. combined_fields would allow us to more directly migrate over the behavior of our current engine to OpenSearch. It would play a big role in our migration project.

Any updates on this feature or ETAs on when it might be done?

@macohen
Copy link
Contributor

macohen commented Jun 16, 2023

I realized that having someone assigned here who is not actively working on this issue is confusing so I unassigned @msfroh. We would be thrilled to have someone submit a PR and help work through on this if anyone is up for it. Otherwise, we hope to get started on this sometime after October, but before Feb/March.

Check this board: https://github.com/orgs/opensearch-project/projects/45 for more context. Also, stay tuned for a public meeting where we can discuss this types of issues in the open. I'm hoping we can do this before the end of June and look forward to working through some of the work that is important to you.

@mingshl mingshl self-assigned this Oct 10, 2023
@floatms
Copy link

floatms commented Oct 15, 2023

I would also like to voice my support for adding a combined_fields query type to OpenSearch.
I have given this some thought and I believe that combined_fields would make for a 'good default' in many cases.
My index contains documents with text fields like title, excerpt, keywords, authors, body etc., but I want my queries to match not in one but all of these fields with different boost values.
From other use cases I've seen and the replies in this thread, this seems common.
So I natuarlly gravitated towards multi_match, but had to learn of its pitfalls very quickly.
If you read through the ElasticSearch docs for multi_match in cross_fields mode for example, it explicitly states:

Note that cross_fields is usually only useful on short string fields that all have a boost of 1. Otherwise boosts, term freqs and length normalization contribute to the score in such a way that the blending of term statistics is not meaningful anymore.

and

The cross_fields type blends field statistics in a complex way that can be hard to interpret. The score combination can even be incorrect, in particular when some documents contain some of the search fields, but not all of them. You should consider the combined_fields query as an alternative, which is also term-centric but combines field statistics in a more robust way.

So while cross_fields is currently the closest we can get to proper cross field queries it is still ways off from what is actually needed in many cases.
What I'm currently doing is extremely hacky. I basically combine all my fields into an additional mega-field and use that in a boolean 'must' clause to filter out the non-matching documents. Then I use the 'should' clauses to boost the different fields. This is not only a hack but also inconvenient because I have to re-index when I update any fields.
I also inflate the index size unnecessarily in this way.

Regarding the analyzer problem: It might be reasonable to require all queried fields to use the same analyzer and error out if they don't. I believe this means that you also just need to call CombinedFieldQuery.Builder.addTerm on each query term once. This seems more predictable.

Unfortunately I don't have the capacity to submit a PR myself currently, but I hope this quick 'user report' can be of use.

@hdhalter
Copy link

@macohen - can we please update the 'release_train' field if this is not going in 2. 12? Thanks!

@macohen
Copy link
Contributor

macohen commented Dec 18, 2023

Yes! Done. @mingshl if you do end up working on this, let's get it back on the release schedule...

@dblock dblock changed the title Add Support combined_fields in OpenSearch Add support for combined_fields (BM25F) Aug 30, 2024
@dblock
Copy link
Member

dblock commented Aug 30, 2024

I renamed the title per a comment from @macrakis to make it clearer ;)

@GauravTech1986
Copy link

Hi, I wanted to check if we know when combined_fields will be available. We have a similar use case as mentioned above, and it looks like cross_fields may not be the best option. Thanks

@prudhvigodithi
Copy link
Contributor

prudhvigodithi commented Sep 23, 2024

Coming from the issue description the main problem is with multi_match which does not support numeric format, likewise combined_fields which uses CombinedFieldQuery throws the error java.lang.IllegalArgumentException: CombinedFieldQuery requires norms to be consistent across fields: some fields cannot have norms enabled, while others have norms disable when indexed the year as new IntPoint, correct me if I'm wrong, OpenSearch identifies the numeric term 2011 and matches it against the year field directly and wont be an issue here @msfroh @dblock.

Not sure, this might work

GET my_index/_search
{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year.keyword",
        "make",
        "model"
      ]
    }
  }
}

when indexed year as

    "year": {
        "type": "text",              // Text field
        "fields": {
          "keyword": {
            "type": "keyword"         // Keyword subfield for exact matches
          }
        }
      },

Thank you
@getsaurabh02

@msfroh
Copy link
Collaborator

msfroh commented Sep 23, 2024

Yeah... Overall, I think we should only allow combined_field to work across text fields that use the same analyzer. The particular model year example isn't great motivation for combined fields.

Where it's more likely to shine is when you have multiple text fields with the same analyzer, but very different distributions of terms. Classic BM25 will look at each field in isolation, and sum up the scores (or take the max score across matching fields). BM25F will essentially combine all the stats across all fields, mathematically treating it like one big field.

@prudhvigodithi
Copy link
Contributor

True @msfroh, ya it its multiple text fields I dont see any problem with CombinedFieldQuery, since the issue description shows year as numeric, even with combined_field is in place for OpenSearch this might not work straight away with numeric fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request help wanted Extra attention is needed Search Search query, autocomplete ...etc
Projects
Status: Next (Next Quarter)
Development

No branches or pull requests