-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for combined_fields (BM25F) #3996
Comments
Agreed, supporting BM25F aka BM 25F (called combined_fields in Elastic) is very useful. This functionality is based on Lucene's BM25FQuery. |
It looks like this is the query we would need to include: https://lucene.apache.org/core/9_4_0/sandbox/org/apache/lucene/sandbox/search/CombinedFieldQuery.html. For something in Lucene sandbox with potential changes in API, I think we would want to consider this part of an experimental release. |
It looks like the server already has a dependency on lucene-sandbox: https://github.com/opensearch-project/OpenSearch/blob/main/server/build.gradle#L109 So, I think it makes sense to add this to the core as a new |
I'm working on adding an API that supports the following: {
"query": {
"combined_fields": {
"query": "2011 Toyota Corolla",
"fields": [
"year^10.0", // Can specify boost
"make",
"model"
],
"analyzer": "whitespace" // Optional
}
}
} I'm thinking of applying the following logic:
I don't think it makes sense to apply more complex query parsing logic (i.e. using Lucene's QueryBuilder) to the |
Here is that logic captured in code: @Override
protected Query doToQuery(QueryShardContext context) throws IOException {
boolean hasMappedField = fieldBoosts.keySet().stream().anyMatch(k -> context.fieldMapper(k) != null);
if (hasMappedField == false) {
return Queries.newUnmappedFieldsQuery(fieldBoosts.keySet());
}
CombinedFieldQuery.Builder builder = new CombinedFieldQuery.Builder();
for (Map.Entry<String, Float> fieldBoost : fieldBoosts.entrySet()) {
builder.addField(fieldBoost.getKey(), fieldBoost.getValue());
}
Analyzer explicitAnalyzer = null;
if (analyzer != null) {
explicitAnalyzer = context.getMapperService().getIndexAnalyzers().get(analyzer);
if (explicitAnalyzer == null) {
throw new IllegalArgumentException("No analyzer found for [" + analyzer + "]");
}
}
for (String fieldName : fieldBoosts.keySet()) {
MappedFieldType fieldType = context.fieldMapper(fieldName);
if (fieldType == null) {
// ignore unmapped fields
continue;
}
Analyzer fieldAnalyzer;
if (explicitAnalyzer == null) {
// Use per-field analyzer
fieldAnalyzer = context.getSearchAnalyzer(fieldType);
} else {
fieldAnalyzer = explicitAnalyzer;
}
collectAllTerms(fieldName, fieldAnalyzer, value.toString(), builder);
}
return builder.build();
}
private static void collectAllTerms(String fieldName, Analyzer analyzer, String queryString,
CombinedFieldQuery.Builder builder) throws IOException {
TokenStream tokenStream = analyzer.tokenStream(fieldName, queryString);
TermToBytesRefAttribute termAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
builder.addTerm(BytesRef.deepCopyOf(termAtt.getBytesRef()));
}
tokenStream.close();
} I still need to write some unit tests before I have a PR ready. |
Oh... I should also probably handle the case where no terms are produced, with an optional |
I think we should support the same parameters as elastic has: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/query-dsl-combined-fields-query.html. |
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
This adds support for the CombinedFieldQuery from the Lucene sandbox. The supported syntax is as follows: ``` { "combined_field": { "query" : "quick brown fox", // required "fields" : [ // if no fields specified then matches nothing "a_text_field", // must be text field, else will be ignored "a_text_field_with_weight^5" ], "analyzer" : "custom_analyzer", // optional "zero_terms_query" : "none" //optional } } ``` If no analyzer is specified, terms are derived from the union of terms from all fields' analyzers. The behavior of zero_terms_query is like for multi_match. Fixes: - opensearch-project#3996 Signed-off-by: Michael Froh <[email protected]>
@SViradiya-MarutiTech -- I've been giving this more thought. I don't think combined fields will help with your use-case. The error message you're getting there seems to be related to the I think your best bet would be add a separate Matching behavior of combined field and |
As another idea -- I'm also thinking that maybe it would make sense to implement this differently. Instead of adding a new query type (as Elasticsearch did), I think it might make more sense to implement combined fields scoring within This isn't a full-baked idea yet, but I'll think through the cases covered by |
Yes, the critical thing about BM25F is that the IDFs are global rather than local to the field. |
This article by Nate Day has a pretty good explanation of the difference in behavior between On The conclusion of that article suggests that using a distinct query type is a good thing to make the changed behavior (from field-centric to term-centric matching) clearer. I'm inclined to agree. That's a pretty good counterargument to my half-baked idea above. |
This feature would be a great addition to OpenSearch. For our use case, we have documents splits into 'title' and 'body'. As the body of the document can greatly vary in length compared to the title, it leads to a drastic overweight of terms in the 'title' compared to the ones in the 'body' for long documents. Using the the BM25F would lead to much more appropriate results |
Are there any updates on this? It would be hugely beneficial to have access to this feature |
Just want to piggy back on some of the previous comments. I'm currently working on a search migration project from a hombrewed search engine to OpenSearch for a job search engine that looks at title, description, and company name fields. Our old search engine combines both term and field centric approaches by treating these fields as one field but at index time we can assign weights to each of these fields to control how much each field contributes to the term frequency for a particular term. With Any updates on this feature or ETAs on when it might be done? |
I realized that having someone assigned here who is not actively working on this issue is confusing so I unassigned @msfroh. We would be thrilled to have someone submit a PR and help work through on this if anyone is up for it. Otherwise, we hope to get started on this sometime after October, but before Feb/March. Check this board: https://github.com/orgs/opensearch-project/projects/45 for more context. Also, stay tuned for a public meeting where we can discuss this types of issues in the open. I'm hoping we can do this before the end of June and look forward to working through some of the work that is important to you. |
I would also like to voice my support for adding a combined_fields query type to OpenSearch.
and
So while cross_fields is currently the closest we can get to proper cross field queries it is still ways off from what is actually needed in many cases. Regarding the analyzer problem: It might be reasonable to require all queried fields to use the same analyzer and error out if they don't. I believe this means that you also just need to call CombinedFieldQuery.Builder.addTerm on each query term once. This seems more predictable. Unfortunately I don't have the capacity to submit a PR myself currently, but I hope this quick 'user report' can be of use. |
@macohen - can we please update the 'release_train' field if this is not going in 2. 12? Thanks! |
Yes! Done. @mingshl if you do end up working on this, let's get it back on the release schedule... |
I renamed the title per a comment from @macrakis to make it clearer ;) |
Hi, I wanted to check if we know when combined_fields will be available. We have a similar use case as mentioned above, and it looks like cross_fields may not be the best option. Thanks |
Coming from the issue description the main problem is with Not sure, this might work
when indexed
Thank you |
Yeah... Overall, I think we should only allow Where it's more likely to shine is when you have multiple text fields with the same analyzer, but very different distributions of terms. Classic BM25 will look at each field in isolation, and sum up the scores (or take the max score across matching fields). BM25F will essentially combine all the stats across all fields, mathematically treating it like one big field. |
True @msfroh, ya it its multiple text fields I dont see any problem with |
Use Case
Currently when i want to search number field with provided free search text using multi_match, I get
number_format_exception
. what we checked in latest version of Elastic Search(7.17) it is possible to search usingcombined_fields
. As AWS does not support combined_fields, we can not use combined_fields and as multi_match has problem with number field, we would not be able to upgrade our AWS OpenSearch, Lets take example.in my_index document, year field is number, When I wanted to search like below:
I got
400
Error:When I remove year field from array of
"fields"
, I got200
Status code.Feature Request
I would like to see
combined_fields
support in latest OpenSearch. So that we can resolve our problem.The text was updated successfully, but these errors were encountered: