-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG FIX] Fix bwc failure in neural sparse search #696
[BUG FIX] Fix bwc failure in neural sparse search #696
Conversation
…oken_score Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
https://github.com/opensearch-project/neural-search/actions/runs/8734284147/job/23965092639?pr=696
|
Just found another bwc flaky test ref. It looks just like the flaky test for neural sparse: we ingest doc in old cluster, but can not search it in the mixed cluster.
Then I do some experiments further. I create a cluster with one 2.14 node and another 3.0 node. Then I create a index with 2 shards(one shard for each node) and write 10 docs to it. When I call cat index to each node, we can see stream IO related errors in node log, and cat index api only return the doc number of local shard(4+6)
In those failed bwc tests, we only ingest one document. If this doc is ingested to other node, we'll get empty search response.(where we failed in bwc) |
How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs? |
Is it possible to identify the incompatible object between 2.14 and 3.0? If this is a breaking change, we might can skip this test. |
Now we have 3 flaky test here: neural sparse search; hybrid search match query; and a test with error log "java.lang.Boolean.booleanValue()" because "isHidden" is null". This PR is fixing the first one. For the first flaky test, there are 2 possible test cases, one is org.opensearch.neuralsearch.bwc.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_NeuralSparseSearch_E2EFlow; and another is org.opensearch.neuralsearch.bwc.NeuralSparseSearchIT.testSparseEncodingProcessor_E2EFlow. The reason behind is same: the neural sparse search request fail to broadcast between nodes and only return local result. |
I just learned that we need to keep bwc between latest 2.x and 3.0 . So we can not skip this test. The bwc issue in core should be considered a BUG. (Now I've found the match query and cat index api have bwc issue) |
I don't see this PR is fixing the issue you mentioned in the related issue part, let's merge this but do not close the issue: #688 |
LGTM |
* Adding integ tests for scenario of hybrid query with aggregations (opensearch-project#632) * Adding tests and params to ignore tests if needed Signed-off-by: Martin Gaievski <[email protected]> * [BUG FIX] Fix bwc failure in neural sparse search (opensearch-project#696) * fix comments Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: zhichao-aws <[email protected]> Co-authored-by: Martin Gaievski <[email protected]>
Description
Recently we find some bwc failures of neural sparse search in rolling-upgrade case (from 2.x to 3.0 upgrade).
ref:
https://github.com/opensearch-project/neural-search/actions/runs/8649948821/job/23718643879?pr=683
https://github.com/opensearch-project/neural-search/actions/runs/8728772234/job/23957283115?pr=694
The reason is, we introduced max_token_score to 2.11 and we cut the PR directly to 2.x, the main branch is not involved. Although we deprecate this field after upgrade to lucene 9.8, this field is still a breaking change in main and 2.x. It will cause serialization/deserialization inconsistence between 3.0 nodes and 2.x nodes. This inconsistence will fail the search request on the shard, and then we get empty search response in bwc test above.
In this PR we add the max_token_score parsing logics to NeuralSparseQueryBuilder. We can parse this field but ignore it in the doToQuery. This logics is consistence with what we do in 2.x after we deprecate the field. After this get merged, max_token_score field parsing and UT logics in main is consistent with 2.x now
Issues Resolved
#688
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.