Add support for combined_fields (BM25F) #3996

SViradiya-MarutiTech · 2022-07-25T06:25:00Z

Use Case

Currently when i want to search number field with provided free search text using multi_match, I get number_format_exception. what we checked in latest version of Elastic Search(7.17) it is possible to search using combined_fields. As AWS does not support combined_fields, we can not use combined_fields and as multi_match has problem with number field, we would not be able to upgrade our AWS OpenSearch, Lets take example.

in my_index document, year field is number, When I wanted to search like below:

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "2011 Toyota Corolla",
      "type": "cross_fields",
      "fields": [
        "year",
        "make",
        "model"
      ]
    }
  }
}

I got 400 Error:

{
    "error": {
        "root_cause": [
            {
                "type": "query_shard_exception",
                "reason": "failed to create query: For input string: \"2012 Toyota Corolla\"",
                "index_uuid": "dkHYbCuLSBKzHcwN3yQx6g",
                "index": "instant_offer"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "instant_offer",
                "node": "-X63-ynlS92dLxcUHMoyjA",
                "reason": {
                    "type": "query_shard_exception",
                    "reason": "failed to create query: For input string: \"2012 Toyota Corolla\"",
                    "index_uuid": "dkHYbCuLSBKzHcwN3yQx6g",
                    "index": "instant_offer",
                    "caused_by": {
                        "type": "number_format_exception",
                        "reason": "For input string: \"2012 Toyota Corolla\""
                    }
                }
            }
        ]
    },
    "status": 400
}

When I remove year field from array of "fields", I got 200 Status code.

Feature Request

I would like to see combined_fields support in latest OpenSearch. So that we can resolve our problem.

GET my_index/_search
{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year",
        "make",
        "model"
      ]
    }
  }
}

The text was updated successfully, but these errors were encountered:

macrakis · 2022-08-16T17:09:00Z

Agreed, supporting BM25F aka BM 25F (called combined_fields in Elastic) is very useful. This functionality is based on Lucene's BM25FQuery.

macohen · 2022-10-04T13:31:29Z

It looks like this is the query we would need to include: https://lucene.apache.org/core/9_4_0/sandbox/org/apache/lucene/sandbox/search/CombinedFieldQuery.html. For something in Lucene sandbox with potential changes in API, I think we would want to consider this part of an experimental release.

msfroh · 2022-10-04T18:24:56Z

It looks like the server already has a dependency on lucene-sandbox: https://github.com/opensearch-project/OpenSearch/blob/main/server/build.gradle#L109

So, I think it makes sense to add this to the core as a new CombinedFieldQueryBuilder under https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/index/query

msfroh · 2022-10-04T23:48:44Z

I'm working on adding an API that supports the following:

{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year^10.0", // Can specify boost
        "make",
        "model"
      ],
      "analyzer": "whitespace" // Optional
    }
  }
}

I'm thinking of applying the following logic:

If an explicit analyzer is specified, we'll pass the query string to the specified analyzer to generate tokens from which we'll extract terms which we'll pass to the Lucene CombinedFieldQuery.Builder.
If no explicit analyzer is specified, we'll iterate through the given fields and ask each analyzer to generate tokens, whose terms will be passed to the CombinedFieldQuery.Builder.

I don't think it makes sense to apply more complex query parsing logic (i.e. using Lucene's QueryBuilder) to the query string, since CombinedFieldsQuery only supports terms.

msfroh · 2022-10-05T00:16:24Z

Here is that logic captured in code:

    @Override
    protected Query doToQuery(QueryShardContext context) throws IOException {
        boolean hasMappedField = fieldBoosts.keySet().stream().anyMatch(k -> context.fieldMapper(k) != null);
        if (hasMappedField == false) {
            return Queries.newUnmappedFieldsQuery(fieldBoosts.keySet());
        }
        CombinedFieldQuery.Builder builder = new CombinedFieldQuery.Builder();
        for (Map.Entry<String, Float> fieldBoost : fieldBoosts.entrySet()) {
            builder.addField(fieldBoost.getKey(), fieldBoost.getValue());
        }
        Analyzer explicitAnalyzer = null;
        if (analyzer != null) {
            explicitAnalyzer = context.getMapperService().getIndexAnalyzers().get(analyzer);
            if (explicitAnalyzer == null) {
                throw new IllegalArgumentException("No analyzer found for [" + analyzer + "]");
            }
        }

        for (String fieldName : fieldBoosts.keySet()) {
            MappedFieldType fieldType = context.fieldMapper(fieldName);
            if (fieldType == null) {
                // ignore unmapped fields
                continue;
            }
            Analyzer fieldAnalyzer;
            if (explicitAnalyzer == null) {
                // Use per-field analyzer
                fieldAnalyzer = context.getSearchAnalyzer(fieldType);
            } else {
                fieldAnalyzer = explicitAnalyzer;
            }
            collectAllTerms(fieldName, fieldAnalyzer, value.toString(), builder);
        }
        return builder.build();
    }

    private static void collectAllTerms(String fieldName, Analyzer analyzer, String queryString,
                                 CombinedFieldQuery.Builder builder) throws IOException {
        TokenStream tokenStream = analyzer.tokenStream(fieldName, queryString);
        TermToBytesRefAttribute termAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            builder.addTerm(BytesRef.deepCopyOf(termAtt.getBytesRef()));
        }
        tokenStream.close();
    }

I still need to write some unit tests before I have a PR ready.

msfroh · 2022-10-05T00:20:30Z

Oh... I should also probably handle the case where no terms are produced, with an optional zero_terms_query parameter.

Yury-Fridlyand · 2022-10-05T02:10:49Z

I think we should support the same parameters as elastic has: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/query-dsl-combined-fields-query.html.

This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>

msfroh · 2022-11-16T21:41:53Z

@SViradiya-MarutiTech -- I've been giving this more thought.

I don't think combined fields will help with your use-case. The error message you're getting there seems to be related to the multi_match trying to delegate query parsing to each of the underlying fields, including the numeric year field (which can't handle the string "2011 Toyota Corolla").

I think your best bet would be add a separate year_text field of type text, and add a copy_to to copy the value from year to year_text. Then you could multi_match on year_text instead.

Matching behavior of combined field and multi_match should be more or less the same, the difference comes from scoring. Where multi_match may give more weight to a value that is less common in a specific matching field, a combined fields query treats all the fields as one big bag of terms for frequency calculation purposes.

msfroh · 2022-12-15T18:35:35Z

As another idea -- I'm also thinking that maybe it would make sense to implement this differently.

Instead of adding a new query type (as Elasticsearch did), I think it might make more sense to implement combined fields scoring within multi_match, behind an option.

This isn't a full-baked idea yet, but I'll think through the cases covered by multi_match and think about how they would behave if we imagine all terms come from one field.

macrakis · 2023-01-05T20:06:20Z

Yes, the critical thing about BM25F is that the IDFs are global rather than local to the field.

msfroh · 2023-01-10T23:23:50Z

This article by Nate Day has a pretty good explanation of the difference in behavior between combined_fields and multi_match when minimum_should_match requires that more than one term must match.

On multi_match, the matching terms requirement must be satisfied by one field, whereas combined_fields allows term matches across fields.

The conclusion of that article suggests that using a distinct query type is a good thing to make the changed behavior (from field-centric to term-centric matching) clearer. I'm inclined to agree. That's a pretty good counterargument to my half-baked idea above.

odelmarcelle · 2023-04-22T17:15:05Z

This feature would be a great addition to OpenSearch. For our use case, we have documents splits into 'title' and 'body'. As the body of the document can greatly vary in length compared to the title, it leads to a drastic overweight of terms in the 'title' compared to the ones in the 'body' for long documents.

Using the the BM25F would lead to much more appropriate results

calebplum · 2023-06-13T08:34:23Z

Are there any updates on this? It would be hugely beneficial to have access to this feature

zr-gwomark · 2023-06-14T21:39:23Z

Just want to piggy back on some of the previous comments. I'm currently working on a search migration project from a hombrewed search engine to OpenSearch for a job search engine that looks at title, description, and company name fields. Our old search engine combines both term and field centric approaches by treating these fields as one field but at index time we can assign weights to each of these fields to control how much each field contributes to the term frequency for a particular term. With best_fields we can achieve a field centric approach but we can't treat IDF as the same for all three fields, additionally with cross_fields we can treat all three fields as having the same idf, but the description field is quite long compared to the other fields and ends up dominating the term frequency scores. combined_fields would allow us to more directly migrate over the behavior of our current engine to OpenSearch. It would play a big role in our migration project.

Any updates on this feature or ETAs on when it might be done?

macohen · 2023-06-16T14:51:53Z

I realized that having someone assigned here who is not actively working on this issue is confusing so I unassigned @msfroh. We would be thrilled to have someone submit a PR and help work through on this if anyone is up for it. Otherwise, we hope to get started on this sometime after October, but before Feb/March.

Check this board: https://github.com/orgs/opensearch-project/projects/45 for more context. Also, stay tuned for a public meeting where we can discuss this types of issues in the open. I'm hoping we can do this before the end of June and look forward to working through some of the work that is important to you.

floatms · 2023-10-15T23:23:50Z

I would also like to voice my support for adding a combined_fields query type to OpenSearch.
I have given this some thought and I believe that combined_fields would make for a 'good default' in many cases.
My index contains documents with text fields like title, excerpt, keywords, authors, body etc., but I want my queries to match not in one but all of these fields with different boost values.
From other use cases I've seen and the replies in this thread, this seems common.
So I natuarlly gravitated towards multi_match, but had to learn of its pitfalls very quickly.
If you read through the ElasticSearch docs for multi_match in cross_fields mode for example, it explicitly states:

Note that cross_fields is usually only useful on short string fields that all have a boost of 1. Otherwise boosts, term freqs and length normalization contribute to the score in such a way that the blending of term statistics is not meaningful anymore.

and

The cross_fields type blends field statistics in a complex way that can be hard to interpret. The score combination can even be incorrect, in particular when some documents contain some of the search fields, but not all of them. You should consider the combined_fields query as an alternative, which is also term-centric but combines field statistics in a more robust way.

So while cross_fields is currently the closest we can get to proper cross field queries it is still ways off from what is actually needed in many cases.
What I'm currently doing is extremely hacky. I basically combine all my fields into an additional mega-field and use that in a boolean 'must' clause to filter out the non-matching documents. Then I use the 'should' clauses to boost the different fields. This is not only a hack but also inconvenient because I have to re-index when I update any fields.
I also inflate the index size unnecessarily in this way.

Regarding the analyzer problem: It might be reasonable to require all queried fields to use the same analyzer and error out if they don't. I believe this means that you also just need to call CombinedFieldQuery.Builder.addTerm on each query term once. This seems more predictable.

Unfortunately I don't have the capacity to submit a PR myself currently, but I hope this quick 'user report' can be of use.

hdhalter · 2023-12-18T20:54:21Z

@macohen - can we please update the 'release_train' field if this is not going in 2. 12? Thanks!

macohen · 2023-12-18T20:56:49Z

Yes! Done. @mingshl if you do end up working on this, let's get it back on the release schedule...

dblock · 2024-08-30T16:51:57Z

I renamed the title per a comment from @macrakis to make it clearer ;)

GauravTech1986 · 2024-09-19T17:44:16Z

Hi, I wanted to check if we know when combined_fields will be available. We have a similar use case as mentioned above, and it looks like cross_fields may not be the best option. Thanks

prudhvigodithi · 2024-09-23T16:41:38Z

Coming from the issue description the main problem is with multi_match which does not support numeric format, likewise combined_fields which uses CombinedFieldQuery throws the error java.lang.IllegalArgumentException: CombinedFieldQuery requires norms to be consistent across fields: some fields cannot have norms enabled, while others have norms disable when indexed the year as new IntPoint, correct me if I'm wrong, OpenSearch identifies the numeric term 2011 and matches it against the year field directly and wont be an issue here @msfroh @dblock.

Not sure, this might work

GET my_index/_search
{
  "query": {
    "combined_fields": {
      "query": "2011 Toyota Corolla",
      "fields": [
        "year.keyword",
        "make",
        "model"
      ]
    }
  }
}

when indexed year as

    "year": {
        "type": "text",              // Text field
        "fields": {
          "keyword": {
            "type": "keyword"         // Keyword subfield for exact matches
          }
        }
      },

Thank you
@getsaurabh02

msfroh · 2024-09-23T17:02:20Z

Yeah... Overall, I think we should only allow combined_field to work across text fields that use the same analyzer. The particular model year example isn't great motivation for combined fields.

Where it's more likely to shine is when you have multiple text fields with the same analyzer, but very different distributions of terms. Classic BM25 will look at each field in isolation, and sum up the scores (or take the max score across matching fields). BM25F will essentially combine all the stats across all fields, mathematically treating it like one big field.

prudhvigodithi · 2024-09-23T17:08:19Z

True @msfroh, ya it its multiple text fields I dont see any problem with CombinedFieldQuery, since the issue description shows year as numeric, even with combined_field is in place for OpenSearch this might not work straight away with numeric fields.

SViradiya-MarutiTech added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 25, 2022

SViradiya-MarutiTech mentioned this issue Jul 25, 2022

Support combined_fields function in SQL/PPL query engine opensearch-project/sql#189

Closed

5 tasks

andrross added feature New feature or request Indexing & Search and removed untriaged labels Jul 25, 2022

msfroh mentioned this issue Oct 6, 2022

Add support for combined_field query #4699

Closed

6 tasks

macohen assigned msfroh Oct 8, 2022

msfroh removed the v2.5.0 'Issues and PRs related to version v2.5.0' label Jan 5, 2023

Yury-Fridlyand mentioned this issue Jan 24, 2023

Reorganize development docs opensearch-project/sql#1200

Merged

6 tasks

macohen added help wanted Extra attention is needed Search Search query, autocomplete ...etc labels Mar 23, 2023

macohen mentioned this issue May 23, 2023

[RFC] Add Field Type Label #7693

Closed

macohen unassigned msfroh Jun 16, 2023

anasalkouz removed the Indexing & Search label Sep 19, 2023

mingshl self-assigned this Oct 10, 2023

macohen mentioned this issue Oct 30, 2023

[DOC] Combined Fields (BM25F) Documentation opensearch-project/documentation-website#5425

Closed

4 tasks

macohen unassigned mingshl Jan 22, 2024

msfroh mentioned this issue May 9, 2024

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

Closed

dblock changed the title ~~Add Support combined_fields in OpenSearch~~ Add support for combined_fields (BM25F) Aug 30, 2024

prudhvigodithi mentioned this issue Sep 23, 2024

Add CombinedFieldQueryExample msfroh/lucene-university#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for combined_fields (BM25F) #3996

Add support for combined_fields (BM25F) #3996

SViradiya-MarutiTech commented Jul 25, 2022

macrakis commented Aug 16, 2022 •

edited

Loading

macohen commented Oct 4, 2022

msfroh commented Oct 4, 2022

msfroh commented Oct 4, 2022

msfroh commented Oct 5, 2022 •

edited

Loading

msfroh commented Oct 5, 2022

Yury-Fridlyand commented Oct 5, 2022

msfroh commented Nov 16, 2022

msfroh commented Dec 15, 2022

macrakis commented Jan 5, 2023

msfroh commented Jan 10, 2023 •

edited

Loading

odelmarcelle commented Apr 22, 2023

calebplum commented Jun 13, 2023 •

edited

Loading

zr-gwomark commented Jun 14, 2023

macohen commented Jun 16, 2023 •

edited

Loading

floatms commented Oct 15, 2023 •

edited

Loading

hdhalter commented Dec 18, 2023

macohen commented Dec 18, 2023

dblock commented Aug 30, 2024

GauravTech1986 commented Sep 19, 2024

prudhvigodithi commented Sep 23, 2024 •

edited

Loading

msfroh commented Sep 23, 2024

prudhvigodithi commented Sep 23, 2024

Add support for combined_fields (BM25F) #3996

Add support for combined_fields (BM25F) #3996

Comments

SViradiya-MarutiTech commented Jul 25, 2022

Use Case

Feature Request

macrakis commented Aug 16, 2022 • edited Loading

macohen commented Oct 4, 2022

msfroh commented Oct 4, 2022

msfroh commented Oct 4, 2022

msfroh commented Oct 5, 2022 • edited Loading

msfroh commented Oct 5, 2022

Yury-Fridlyand commented Oct 5, 2022

msfroh commented Nov 16, 2022

msfroh commented Dec 15, 2022

macrakis commented Jan 5, 2023

msfroh commented Jan 10, 2023 • edited Loading

odelmarcelle commented Apr 22, 2023

calebplum commented Jun 13, 2023 • edited Loading

zr-gwomark commented Jun 14, 2023

macohen commented Jun 16, 2023 • edited Loading

floatms commented Oct 15, 2023 • edited Loading

hdhalter commented Dec 18, 2023

macohen commented Dec 18, 2023

dblock commented Aug 30, 2024

GauravTech1986 commented Sep 19, 2024

prudhvigodithi commented Sep 23, 2024 • edited Loading

msfroh commented Sep 23, 2024

prudhvigodithi commented Sep 23, 2024

macrakis commented Aug 16, 2022 •

edited

Loading

msfroh commented Oct 5, 2022 •

edited

Loading

msfroh commented Jan 10, 2023 •

edited

Loading

calebplum commented Jun 13, 2023 •

edited

Loading

macohen commented Jun 16, 2023 •

edited

Loading

floatms commented Oct 15, 2023 •

edited

Loading

prudhvigodithi commented Sep 23, 2024 •

edited

Loading