Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IndexOutOfBoundsException in Hybrid search for some queries only #497

Closed
tiagoshin opened this issue Nov 21, 2023 · 37 comments
Closed
Assignees
Labels
bug Something isn't working v2.15.0

Comments

@tiagoshin
Copy link

tiagoshin commented Nov 21, 2023

What is the bug?

I'm using Hybrid search in Opensearch version 2.11, and I'm getting the following error in some queries:

{
    "error": {
        "root_cause": [
            {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "index-name",
                "node": "2aLNPmEjQ8OYCuCFyEyI-Q",
                "reason": {
                    "type": "index_out_of_bounds_exception",
                    "reason": null
                }
            }
        ],
        "caused_by": {
            "type": "index_out_of_bounds_exception",
            "reason": null,
            "caused_by": {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        }
    },
    "status": 500
}

I get these logs:

2023-11-21 16:53:16 org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:706) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:745) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:503) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:301) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:755) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TransportService$6.handleException(TransportService.java:903) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1640) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1614) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:80) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:72) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:70) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
2023-11-21 16:53:16     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
2023-11-21 16:53:16     at java.lang.Thread.run(Thread.java:833) [?:?]
2023-11-21 16:53:16 Caused by: org.opensearch.OpenSearchException$3
2023-11-21 16:53:16     at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:708) ~[opensearch-core-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:377) [opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     ... 23 more
2023-11-21 16:53:16 Caused by: java.lang.IndexOutOfBoundsException
2023-11-21 16:53:16     at java.nio.Buffer.checkIndex(Buffer.java:743) ~[?:?]
2023-11-21 16:53:16     at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:339) ~[?:?]
2023-11-21 16:53:16     at org.apache.lucene.store.ByteBufferGuard.getByte(ByteBufferGuard.java:115) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readByte(ByteBufferIndexInput.java:564) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$8.longValue(Lucene90NormsProducer.java:443) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.LeafSimScorer.getNormValue(LeafSimScorer.java:47) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.LeafSimScorer.score(LeafSimScorer.java:60) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionSumScorer.score(DisjunctionSumScorer.java:41) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:193) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:193) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:193) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.ReqOptSumScorer.score(ReqOptSumScorer.java:273) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.opensearch.neuralsearch.query.HybridQueryScorer.score(HybridQueryScorer.java:64) ~[?:?]
2023-11-21 16:53:16     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.ScoreCachingWrappingScorer.score(ScoreCachingWrappingScorer.java:61) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.opensearch.common.lucene.MinimumScoreCollector.collect(MinimumScoreCollector.java:78) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:274) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:254) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:322) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:281) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:551) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:354) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:441) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWith(QueryPhase.java:425) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhaseSearcherWrapper.searchWith(QueryPhaseSearcherWrapper.java:65) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:66) ~[?:?]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:280) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:153) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:533) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:597) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:566) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.11.0.jar:2.11.0]
2023-11-21 16:53:16     ... 8 more

How can one reproduce the bug?

Honestly, it's very hard to reproduce the bug. As I'm using my company's data, I cannot share it publicly. However, we can work on enabling this privately.

What is the expected behavior?

The expectancy is to not get the error for the hybrid search.

What is your host/environment?

MacOS Ventura 13.3.1, I'm running on Docker compose.

Do you have any additional context?

When I search on the exact same index for semantic search or lexical search, it works properly. It only happens for Hybrid search.
I observe a pattern that queries with more than one word tend to be more likely to have this error than simple queries. Queries that failed are like "horror movies", "teen mom", "news radio".
However, I observed that when I changed the combination technique, some queries started working, and other queries started failing.
I also observed that when I changed the index data, some queries started working, and other queries started failing.
However, for the same data and same settings, results are idempotent.

@tiagoshin tiagoshin added bug Something isn't working untriaged labels Nov 21, 2023
@navneet1v
Copy link
Collaborator

@tiagoshin can share the query which you are using?

@navneet1v
Copy link
Collaborator

I can see the exception is coming from this: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/HybridQueryScorer.java#L64

@tiagoshin please share your query skeleton, so that it can better help us debug the issue here.

@tiagoshin
Copy link
Author

Thank you @navneet1v, I shared the query skeleton with David Fowler from AWS customer support, did you receive the query?

@navneet1v
Copy link
Collaborator

@tiagoshin Looking at logs which are shared, I can see that HybridQueryPhaseSearcher which is responsible for running the query is not invoked. This let me believe that either the hybrid query clause was not the top level clause, or there are some nested fields in the index which lead to wrapping of hybrid query clause with other query clauses(This is OpenSearch default behavior).

We are already working on a fix for nested query clauses, as part of this github issue: #466.

@tiagoshin
Copy link
Author

Hi @navneet1v, I see the HybridQueryPhaseSearcher invoked in the following line, isn't it?

2023-11-21 16:53:16 at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:66) ~[?:?]

@navneet1v
Copy link
Collaborator

@tiagoshin if you look at the code: https://github.com/opensearch-project/neural-search/blob/2.11/src/main/java/org/opensearch/neuralsearch/search/query/HybridQueryPhaseSearcher.java#L66

Line 66 will hit if the query is not the top level query is not hybrid query.

@tiagoshin
Copy link
Author

That makes sense, thank you @navneet1v!

@martin-gaievski
Copy link
Member

We have pushed a code change that should fix this issue, please check details in this issue comment: #466 (comment)

@vamshin vamshin added the v2.12.0 Issues targeting release v2.12.0 label Dec 14, 2023
@Lemmmy
Copy link

Lemmmy commented Dec 17, 2023

I'm getting a similar, but different exception, on OS 2.11.1 (6b1986e964d440be9137eba1413015c31c5a7752):

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
        at org.apache.lucene.search.DisiPriorityQueue.add(DisiPriorityQueue.java:100) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.opensearch.neuralsearch.query.HybridQueryScorer.initializeSubScorersPQ(HybridQueryScorer.java:146) ~[?:?]
        at org.opensearch.neuralsearch.query.HybridQueryScorer.<init>(HybridQueryScorer.java:47) ~[?:?]
        at org.opensearch.neuralsearch.query.HybridQueryWeight.scorer(HybridQueryWeight.java:91) ~[?:?]
        at org.apache.lucene.search.Weight.bulkScorer(Weight.java:166) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.opensearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:374) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:319) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:281) ~[opensearch-2.11.1.jar:2.11.1]
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:551) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWithCollector(HybridQueryPhaseSearcher.java:104) ~[?:?]
        at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:64) ~[?:?]
        at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:280) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:153) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:533) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:597) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:566) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.11.1.jar:2.11.1]
        ... 8 more

Full exception: aioobe.txt

Unfortunately I'm not familiar enough with the subject matter to know if this is the same exception or if it has been patched. I get this error more reproducibly on my single-node cluster with only 8800 documents and the following search pipeline and query:

{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.6,
              0.3,
              0.1
            ]
          }
        }
      }
    }
  ]
}

Query:

{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ],
  "_source": {
    "exclude": [
      "text_embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match_phrase": {
            "text": {
              "query": "foo"
            }
          }
        },
        {
          "match": {
            "text": {
              "query": "foo"
            }
          }
        },
        {
          "neural": {
            "text_embedding": {
              "query_text": "foo",
              "model_id": "--------",
              "k": 5
            }
          }
        }
      ]
    }
  }
}

I have narrowed down the issue to occurring when one or more of the sub-queries return effectively 0 results after normalizastion. That is - the scores are so low after normalization that they are completely discarded. If I remove two of the sub-queries and disable the search pipeline, the query works. Or if I make a more specific query where the sub-queries return a similar number of results, the query also works.

I'm happy to provide more information if needed, or make a new issue if it's not the same one as this/#466. I'm running in Docker, so not quite sure how to test the RC build from that thread.

Edit: also tried on 2.12.0, still happening. Is this new issue material?

@navneet1v
Copy link
Collaborator

Edit: also tried on 2.12.0, still happening. Is this new issue material?

@Lemmmy so what you are saying that you tried on the tar provided here in this comment: #466 (comment) and it is still not working.

cc: @martin-gaievski

@navneet1v
Copy link
Collaborator

I'm running in Docker, so not quite sure how to test the RC build from that thread.

@Lemmmy the CIs of Opensearch publishes the builds everyday in Opensearch staging repo of Docker: https://hub.docker.com/r/opensearchstaging/opensearch/tags

You can use this: docker pull opensearchstaging/opensearch:2.12.0 to pull the 2.12.0 version of opensearch and see if the issue is still existing.

@navneet1v
Copy link
Collaborator

navneet1v commented Dec 19, 2023

@Lemmmy I did some more deep-dive and I am able to reproduce the issue. I also tested with different queries where one query clause doesn't yield any result. That use case is working perfectly.

But I able to figure out the root cause of the exception you are getting. Here are the steps to reproduce:

Setup

PUT example-index
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "my_vector": {
        "type": "knn_vector",
        "dimension": 1,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      },
      "integer": {
        "type": "integer"
      }
    }
  }
}


PUT example-index/_bulk?refresh
{"index":{"_id":"1"}}
{"text": "neural","my_vector": [5], "integer": 1 }
{"index":{"_id":"2"}}
{"text": "neural neural","my_vector": [4], "integer": 2 }
{"index":{"_id":"3"}}
{"text": "neural neural neural","my_vector": [3], "integer": 3 }
{"index":{"_id":"4"}}
{"text": "neural neural neural neural", "integer": 4 }
{"index":{"_id":"5"}}
{"my_vector": [0], "integer": 5 }


PUT /_search/pipeline/nlp-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        }
      }
    }
  ]
}


# Search Query
POST example-index/_search?search_pipeline=nlp-search-pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "term": {
            "text": "neural"
          }
        },
        {
          "term": {
            "text": "neural"
          }
        },
        {
          "knn": {
            "my_vector": {
              "vector": [
                3
              ],
              "k": 3
            }
          }
        }
      ]
    }
  },
  "size": 3
}

Output of Search

{
	"error": {
		"root_cause": [
			{
				"type": "array_index_out_of_bounds_exception",
				"reason": "Index 2 out of bounds for length 2"
			}
		],
		"type": "search_phase_execution_exception",
		"reason": "all shards failed",
		"phase": "query",
		"grouped": true,
		"failed_shards": [
			{
				"shard": 0,
				"index": "example-index",
				"node": "roL2TjVsTdex976hXKl9jg",
				"reason": {
					"type": "array_index_out_of_bounds_exception",
					"reason": "Index 2 out of bounds for length 2"
				}
			}
		],
		"caused_by": {
			"type": "array_index_out_of_bounds_exception",
			"reason": "Index 2 out of bounds for length 2",
			"caused_by": {
				"type": "array_index_out_of_bounds_exception",
				"reason": "Index 2 out of bounds for length 2"
			}
		}
	},
	"status": 500
}

Stacktrace

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
	at org.apache.lucene.search.DisiPriorityQueue.add(DisiPriorityQueue.java:100) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.neuralsearch.query.HybridQueryScorer.initializeSubScorersPQ(HybridQueryScorer.java:146) ~[?:?]
	at org.opensearch.neuralsearch.query.HybridQueryScorer.<init>(HybridQueryScorer.java:47) ~[?:?]
	at org.opensearch.neuralsearch.query.HybridQueryWeight.scorer(HybridQueryWeight.java:91) ~[?:?]
	at org.apache.lucene.search.Weight.bulkScorer(Weight.java:166) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:374) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:319) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:281) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:551) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWithCollector(HybridQueryPhaseSearcher.java:104) ~[?:?]
	at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:64) ~[?:?]
	at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:280) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:153) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:533) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:597) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:566) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.11.1-SNAPSHOT.jar:2.11.1-SNAPSHOT]
	... 8 more

Root Cause

So, what happening here is if we look at the queries provided in the hybrid clause, I have deliberately put my 2 text search queries exactly same.

        {
          "term": {
            "text": "neural"
          }
        }

We create a map of Query to the index(key being the query object) here and use that map here to create PQ and to assign the scorers created for each query. Because both the text queries are same, the map we are creating instead of having size 3(as we have 3 queries) it is getting created with size 2. Which is leading to the exception.

Now, in production I don't expect users to provide two exactly same queries. But this is a bug.

Please let me know if removing the duplicate queries solves your issue.

Proposed Solution

We should go ahead and throw out an exception with proper message to the user that the queries defined have duplicates in it. @Lemmmy Please let me know your thoughts on this.

cc: @martin-gaievski

@navneet1v
Copy link
Collaborator

@tiagoshin I some deep-dive here: #497 (comment) can you check on your side for you also this was the issue? if not can you provide the query skeleton so that I can make sure that all bugs provided in this issue are resolved. I understand that your query contained nested fields which we have already fixed for 2.12. But is there any other issue that you are facing please do comment, so that it can be fixed in 2.12

@Lemmmy
Copy link

Lemmmy commented Dec 19, 2023

Thanks for the quick investigation! To clarify, am I supposed to avoid combining match and neural with the same query? As in, this isn't okay? (from the docs):

"query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "Hi world"
            }
          }
        },
        {
          "neural": {
            "passage_embedding": {
              "query_text": "Hi world",
              "model_id": "aVeif4oB5Vm0Tdw8zYO2",
              "k": 5
            }
          }
        }
      ]
    }
  }

Or is it just because of my use of both match_phrase and match?

When changing this line:

DisiPriorityQueue subScorersPQ = new DisiPriorityQueue(queryToIndex.size());

To:

-DisiPriorityQueue subScorersPQ = new DisiPriorityQueue(queryToIndex.size());
+DisiPriorityQueue subScorersPQ = new DisiPriorityQueue(subScorers.size());

The query I provided in #497 (comment) no longer errors and the results look roughly as I'd expect.

@navneet1v
Copy link
Collaborator

Thanks for the quick investigation! To clarify, am I supposed to avoid combining match and neural with the same query? As in, this isn't okay? (from the docs):

"query": {
"hybrid": {
"queries": [
{
"match": {
"text": {
"query": "Hi world"
}
}
},
{
"neural": {
"passage_embedding": {
"query_text": "Hi world",
"model_id": "aVeif4oB5Vm0Tdw8zYO2",
"k": 5
}
}
}
]
}
}
Or is it just because of my use of both match_phrase and match?

This is okay..

But in your case:

{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ],
  "_source": {
    "exclude": [
      "text_embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match_phrase": {
            "text": {
              "query": "foo"
            }
          }
        },
        {
          "match": {
            "text": {
              "query": "foo"
            }
          }
        },
        {
          "neural": {
            "text_embedding": {
              "query_text": "foo",
              "model_id": "--------",
              "k": 5
            }
          }
        }
      ]
    }
  }
}

The match_phrase and match are actually boiling down to same queries and hence the issue was happening.

@Lemmmy
Copy link

Lemmmy commented Dec 19, 2023

Ah, that makes a lot more sense, I will fix that then. Thanks for all your help.

@navneet1v
Copy link
Collaborator

navneet1v commented Dec 19, 2023

Ah, that makes a lot more sense, I will fix that then. Thanks for all your help.

Sure, I am planning to add an exception signature if we found out queries are same and then throw the exception out from here: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/HybridQueryBuilder.java#L297

like this:

if(queries.size() != new HashSet<>(queries).size()) {
            throw new OpenSearchException("There are duplicates in the query.");
        }

This will ensure that queries are not run, because if we do this change

-DisiPriorityQueue subScorersPQ = new DisiPriorityQueue(queryToIndex.size());
+DisiPriorityQueue subScorersPQ = new DisiPriorityQueue(subScorers.size());

it has some other side effects in the code.

@tiagoshin
Copy link
Author

tiagoshin commented Dec 19, 2023

Hi @navneet1v, thank you very much for your attention.
I'm testing the build for 2.12.0 from RC build and now I'm getting distinct errors.
For all queries, when I perform hybrid search, I got:

    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "totalHitsThreshold must be less than max integer value"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "test",
                "node": "si1uOQWhRMWsWbFC6kaKjg",
                "reason": {
                    "type": "illegal_argument_exception",
                    "reason": "totalHitsThreshold must be less than max integer value"
                }
            }
        ],
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "totalHitsThreshold must be less than max integer value",
            "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "totalHitsThreshold must be less than max integer value"
            }
        }
    },
    "status": 400
}

So I increased track_total_hits to 50,000 and it worked for some queries. For other queries, I got the following error:

    "error": {
        "root_cause": [],
        "type": "search_phase_execution_exception",
        "reason": "The phase has failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [],
        "caused_by": {
            "type": "illegal_state_exception",
            "reason": "Score normalization processor cannot produce final query result"
        }
    },
    "status": 500
}

Here are the logs:

2023-12-19 18:25:36 opensearch_semantic1  | org.opensearch.action.search.SearchPhaseExecutionException: The phase has failed
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:718) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:622) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResultConsumed(AbstractSearchAsyncAction.java:607) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$onShardResult$9(AbstractSearchAsyncAction.java:590) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.QueryPhaseResultConsumer$PendingMerges.consume(QueryPhaseResultConsumer.java:373) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:132) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:590) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:161) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:292) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:59) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:44) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:99) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:746) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.transport.TransportService$9.handleResponse(TransportService.java:1693) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1475) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1558) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1538) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
2023-12-19 18:25:36 opensearch_semantic1  | Caused by: org.opensearch.search.pipeline.SearchPipelineProcessingException: java.lang.IllegalStateException: Score normalization processor cannot produce final query result
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.search.pipeline.Pipeline.runSearchPhaseResultsTransformer(Pipeline.java:295) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.search.pipeline.PipelinedRequest.transformSearchPhaseResults(PipelinedRequest.java:47) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:755) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:620) [opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     ... 31 more
2023-12-19 18:25:36 opensearch_semantic1  | Caused by: java.lang.IllegalStateException: Score normalization processor cannot produce final query result
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.neuralsearch.processor.NormalizationProcessorWorkflow.getSearchHits(NormalizationProcessorWorkflow.java:177) ~[?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.neuralsearch.processor.NormalizationProcessorWorkflow.updateOriginalFetchResults(NormalizationProcessorWorkflow.java:142) ~[?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.neuralsearch.processor.NormalizationProcessorWorkflow.execute(NormalizationProcessorWorkflow.java:73) ~[?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.neuralsearch.processor.NormalizationProcessor.process(NormalizationProcessor.java:62) ~[?:?]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.search.pipeline.SearchPhaseResultsProcessor.process(SearchPhaseResultsProcessor.java:48) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.search.pipeline.Pipeline.runSearchPhaseResultsTransformer(Pipeline.java:276) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.search.pipeline.PipelinedRequest.transformSearchPhaseResults(PipelinedRequest.java:47) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:755) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:620) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-19 18:25:36 opensearch_semantic1  |     ... 31 more

@navneet1v
Copy link
Collaborator

@tiagoshin can you share the query skeleton with me so that I can reproduce the issue. BTW are you setting track_total_hits in the query?

@tiagoshin
Copy link
Author

@navneet1v I shared the query and artifacts with David Fowler. Could you please get them with him?

@tiagoshin
Copy link
Author

@navneet1v I got the same issue that I reported before about the IndexOutOfBoundsException on version 2.12.0 when increasing the ef_construction parameter to 1024. Before that, the exact same query with the same data and model was working for a particular query. Once I increased the ef_construction parameter, I got the following error:

{
    "error": {
        "root_cause": [
            {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "pluto-test2",
                "node": "j4rUlY77ToenCAVXWKUnxA",
                "reason": {
                    "type": "index_out_of_bounds_exception",
                    "reason": null
                }
            }
        ],
        "caused_by": {
            "type": "index_out_of_bounds_exception",
            "reason": null,
            "caused_by": {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        }
    },
    "status": 500
}

On the logs I see:

2023-12-21 14:39:34 org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:718) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:757) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:511) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:301) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:755) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$9.handleException(TransportService.java:1699) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1485) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1599) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1573) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:81) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:73) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:70) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
2023-12-21 14:39:34     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
2023-12-21 14:39:34     at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
2023-12-21 14:39:34 Caused by: org.opensearch.OpenSearchException$3
2023-12-21 14:39:34     at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:708) ~[opensearch-core-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:377) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     ... 23 more
2023-12-21 14:39:34 Caused by: java.lang.IndexOutOfBoundsException
2023-12-21 14:39:34     at java.base/java.nio.Buffer.checkIndex(Buffer.java:687) ~[?:?]
2023-12-21 14:39:34     at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:269) ~[?:?]
2023-12-21 14:39:34     at org.apache.lucene.store.ByteBufferGuard.getByte(ByteBufferGuard.java:115) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readByte(ByteBufferIndexInput.java:564) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$8.longValue(Lucene90NormsProducer.java:443) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.LeafSimScorer.getNormValue(LeafSimScorer.java:47) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.LeafSimScorer.score(LeafSimScorer.java:60) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.TermScorer.score(TermScorer.java:86) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionSumScorer.score(DisjunctionSumScorer.java:41) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ReqOptSumScorer.score(ReqOptSumScorer.java:266) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.query.HybridQueryScorer.hybridScores(HybridQueryScorer.java:117) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.HybridTopScoreDocCollector$1.collect(HybridTopScoreDocCollector.java:65) ~[?:?]
2023-12-21 14:39:34     at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:277) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:236) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:326) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:282) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:549) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWithCollector(HybridQueryPhaseSearcher.java:219) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:72) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:282) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:155) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:547) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:611) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:580) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     ... 8 more

However, if I decrease the ef_construction, the queries that were getting the error reported here, kept having the same error. So decreasing ef_construction doesn't solve other issues, but increasing it may cause this error.

@navneet1v
Copy link
Collaborator

@navneet1v I got the same issue that I reported before about the IndexOutOfBoundsException on version 2.12.0 when increasing the ef_construction parameter to 1024. Before that, the exact same query with the same data and model was working for a particular query. Once I increased the ef_construction parameter, I got the following error:

Block (35 lines)

{
    "error": {
        "root_cause": [
            {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "pluto-test2",
                "node": "j4rUlY77ToenCAVXWKUnxA",
                "reason": {
                    "type": "index_out_of_bounds_exception",
                    "reason": null
                }
            }
        ],
        "caused_by": {
            "type": "index_out_of_bounds_exception",
            "reason": null,
            "caused_by": {
                "type": "index_out_of_bounds_exception",
                "reason": null
            }
        }
    },
    "status": 500
}

On the logs I see:

Block (69 lines)

2023-12-21 14:39:34 org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:718) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:757) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:511) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:301) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:755) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$9.handleException(TransportService.java:1699) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1485) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1599) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1573) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:81) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:73) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:70) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:911) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
2023-12-21 14:39:34     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
2023-12-21 14:39:34     at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
2023-12-21 14:39:34 Caused by: org.opensearch.OpenSearchException$3
2023-12-21 14:39:34     at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:708) ~[opensearch-core-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:377) [opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     ... 23 more
2023-12-21 14:39:34 Caused by: java.lang.IndexOutOfBoundsException
2023-12-21 14:39:34     at java.base/java.nio.Buffer.checkIndex(Buffer.java:687) ~[?:?]
2023-12-21 14:39:34     at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:269) ~[?:?]
2023-12-21 14:39:34     at org.apache.lucene.store.ByteBufferGuard.getByte(ByteBufferGuard.java:115) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readByte(ByteBufferIndexInput.java:564) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$8.longValue(Lucene90NormsProducer.java:443) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.LeafSimScorer.getNormValue(LeafSimScorer.java:47) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.LeafSimScorer.score(LeafSimScorer.java:60) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.TermScorer.score(TermScorer.java:86) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionSumScorer.score(DisjunctionSumScorer.java:41) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:178) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.ReqOptSumScorer.score(ReqOptSumScorer.java:266) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.query.HybridQueryScorer.hybridScores(HybridQueryScorer.java:117) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.HybridTopScoreDocCollector$1.collect(HybridTopScoreDocCollector.java:65) ~[?:?]
2023-12-21 14:39:34     at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:277) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:236) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:326) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:282) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:549) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWithCollector(HybridQueryPhaseSearcher.java:219) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:72) ~[?:?]
2023-12-21 14:39:34     at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:282) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:155) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:547) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:611) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:580) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
2023-12-21 14:39:34     ... 8 more

However, if I decrease the ef_construction, the queries that were getting the error reported here, kept having the same error. So decreasing ef_construction doesn't solve other issues, but increasing it may cause this error.

The IndexOutOfBoundsException exception fix is not there in 2.12, the 2.12 contains only the fix for nestedQueries. If you look at my RCA done here: #497 (comment) it provides the info that if you have 2 queries which are same then in that case the issue will happen. So, check your array of hybrid queries and see if there are duplicates. If yes remove them and this can be a short fix from your side. Meanwhile we deicide how to handle the duplicate queries.

@tiagoshin
Copy link
Author

@navneet1v I saw your comment about having 2 queries that are the same, but it's not the case. I have only 2 queries, one for neural search and the other for lexical search. One of them uses a model, and the other uses a bunch of matching rules.
Also, I'm reporting here that I got the IndexOutOfBoundsException just by changing the ef_construction parameter when recreating the index. I didn't change the query, so the same query that was working before failed with a distinct ef_construction parameter. That's why I don't think duplicate queries are the issue for me.

@navneet1v
Copy link
Collaborator

navneet1v commented Dec 21, 2023

@navneet1v I saw your comment about having 2 queries that are the same, but it's not the case. I have only 2 queries, one for neural search and the other for lexical search. One of them uses a model, and the other uses a bunch of matching rules. Also, I'm reporting here that I got the IndexOutOfBoundsException just by changing the ef_construction parameter when recreating the index. I didn't change the query, so the same query that was working before failed with a distinct ef_construction parameter. That's why I don't think duplicate queries are the issue for me.

Thanks will check this new error trace. BTW this is pretty counter intuitive that ef_construction is creating a problem that too while doing the query. This is a parameter that is used during index build time.

I am thinking this could be happening due to high value of ef_construction only if the overall memory of the system is under stress. To better understand this can you share few more details like:

  1. JVM heap size of Opensearch
  2. Which KNN engine you are using: nmslib, faiss, lucene
  3. Machine/Docker RAM where you are running the Opensearch process.

@tiagoshin

@tiagoshin
Copy link
Author

@navneet1v you're right, this issue in particular is caused by system memory constraints. I increased the JVM heap size and it worked, thank you!
However it worth noting that the other issue keeps happening

@navneet1v
Copy link
Collaborator

@navneet1v you're right, this issue in particular is caused by system memory constraints. I increased the JVM heap size and it worked, thank you! However it worth noting that the other issue keeps happening

Thanks for the response. I am working on that issue. Doing some more validations before I put a Root cause and the fix for the issue.

@navneet1v
Copy link
Collaborator

navneet1v commented Dec 22, 2023

So I was able to get to the rootcause of the issue mentioned here(#497 (comment)):

Hi @navneet1v, thank you very much for your attention.
I'm testing the build for 2.12.0 from RC build and now I'm getting distinct errors.
For all queries, when I perform hybrid search, I got:

"error": {
    "root_cause": [
        {
            "type": "illegal_argument_exception",
            "reason": "totalHitsThreshold must be less than max integer value"
        }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
        {
            "shard": 0,
            "index": "test",
            "node": "si1uOQWhRMWsWbFC6kaKjg",
            "reason": {
                "type": "illegal_argument_exception",
                "reason": "totalHitsThreshold must be less than max integer value"
            }
        }
    ],
    "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "totalHitsThreshold must be less than max integer value",
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "totalHitsThreshold must be less than max integer value"
        }
    }
},
"status": 400

}

So the first issue where we are seeing totalHitsThreshold must be less than max integer value is coming from this line:

if (totalHitsThreshold == Integer.MAX_VALUE) {
throw new IllegalArgumentException(String.format(Locale.ROOT, "totalHitsThreshold must be less than max integer value"));
}

This case happen when we are adding track_total_hits: true in search request, rather than any integer value. I think we can remove the check. When track_total_hits: true the value of total hits become Integer.MAX_VALUE and hence the check fails. I check with other query clauses track_total_hits: true works. I will go ahead and fix this.

For the second issue where was track_total_hits: 50000, @tiagoshin can you provide me this info:

  1. How many shards you were using?
  2. How many data nodes we are using?
  3. How many total documents were there in the index?

@tiagoshin
Copy link
Author

@navneet1v

  1. I'm using two shards:
index                   shard prirep state    docs   store ip         node
.plugins-ml-model-group 0     p      STARTED     1  12.5kb 172.18.0.3 node-1
.plugins-ml-model-group 0     r      STARTED     1   5.5kb 172.18.0.2 node-2
.plugins-ml-config      0     p      STARTED     1   3.9kb 172.18.0.3 node-1
.plugins-ml-config      0     r      STARTED     1   3.9kb 172.18.0.2 node-2
.plugins-ml-model       0     p      STARTED    11 115.8mb 172.18.0.3 node-1
.plugins-ml-model       0     r      STARTED    11 115.9mb 172.18.0.2 node-2
.plugins-ml-task        0     p      STARTED     2  44.4kb 172.18.0.3 node-1
.plugins-ml-task        0     r      STARTED     2  36.8kb 172.18.0.2 node-2
test                    0     p      STARTED 75997   190mb 172.18.0.3 node-1
test                    0     r      STARTED 75997 191.9mb 172.18.0.2 node-2
  1. I'm using 2 data nodes
  2. I have 75997 documents:
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .plugins-ml-model-group 8radM4MFTvSD0ml76PrneA   1   1          1            0       18kb         12.5kb
green  open   .plugins-ml-config      4KiNinC1QTCmEwmOwK-omw   1   1          1            0      7.8kb          3.9kb
green  open   .plugins-ml-model       RYKMfd3KTj2OfiK4madWyw   1   1         11            0    231.8mb        115.8mb
green  open   .plugins-ml-task        E_rRWs4vSuulZ6U6n2FY9g   1   1          2            0     81.2kb         44.4kb
green  open   test                    GSRltbzPQVeJ5h7MoxYSdg   1   1      75997        21976      382mb          190mb

@navneet1v
Copy link
Collaborator

@navneet1v

  1. I'm using two shards:

Block (12 lines)

index                   shard prirep state    docs   store ip         node
.plugins-ml-model-group 0     p      STARTED     1  12.5kb 172.18.0.3 node-1
.plugins-ml-model-group 0     r      STARTED     1   5.5kb 172.18.0.2 node-2
.plugins-ml-config      0     p      STARTED     1   3.9kb 172.18.0.3 node-1
.plugins-ml-config      0     r      STARTED     1   3.9kb 172.18.0.2 node-2
.plugins-ml-model       0     p      STARTED    11 115.8mb 172.18.0.3 node-1
.plugins-ml-model       0     r      STARTED    11 115.9mb 172.18.0.2 node-2
.plugins-ml-task        0     p      STARTED     2  44.4kb 172.18.0.3 node-1
.plugins-ml-task        0     r      STARTED     2  36.8kb 172.18.0.2 node-2
test                    0     p      STARTED 75997   190mb 172.18.0.3 node-1
test                    0     r      STARTED 75997 191.9mb 172.18.0.2 node-2
  1. I'm using 2 data nodes
  2. I have 75997 documents:
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .plugins-ml-model-group 8radM4MFTvSD0ml76PrneA   1   1          1            0       18kb         12.5kb
green  open   .plugins-ml-config      4KiNinC1QTCmEwmOwK-omw   1   1          1            0      7.8kb          3.9kb
green  open   .plugins-ml-model       RYKMfd3KTj2OfiK4madWyw   1   1         11            0    231.8mb        115.8mb
green  open   .plugins-ml-task        E_rRWs4vSuulZ6U6n2FY9g   1   1          2            0     81.2kb         44.4kb
green  open   test                    GSRltbzPQVeJ5h7MoxYSdg   1   1      75997        21976      382mb          190mb

Actually you are using 1 shard. the other shard is a replica of the first shard. But thanks for this information. The code path which is resulting in this issue that you are getting when you set track_total_hits: 50000 can only come if primary shards are 1.

Just for resolving the issue for now, can you try with more than 1 primary shards. and see if you still face the issue when track_total_hits: 50000. I am hoping you won't

@tiagoshin
Copy link
Author

@navneet1v It worked when increasing shards to 2, thank you very much!
What's your advice about this if I increase the number of replicas as well?

@navneet1v
Copy link
Collaborator

@navneet1v It worked when increasing shards to 2, thank you very much! What's your advice about this if I increase the number of replicas as well?

Replicas will have no impact. You can keep it whatever you want.

Just to put it 1 more time, i am still going to do deep-dive to fix the issue with 1 shard too. But for now happy to know you are unblocked

martin-gaievski added a commit that referenced this issue Dec 29, 2023
* Allow multiple identical sub-queries in hybrid query, removed validation for total hits

Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this issue Jan 2, 2024
…-project#524)

* Allow multiple identical sub-queries in hybrid query, removed validation for total hits

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 585fbbe)
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this issue Jan 2, 2024
…-project#524)

* Allow multiple identical sub-queries in hybrid query, removed validation for total hits

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 585fbbe)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this issue Jan 2, 2024
* Allow multiple identical sub-queries in hybrid query, removed validation for total hits


(cherry picked from commit 585fbbe)

Signed-off-by: Martin Gaievski <[email protected]>
@tiagoshin
Copy link
Author

Hi @navneet1v!
David Fowler provided me a patch to test the version with the fix on the AWS OpenSearch cluster.
While during tests, I verified that I keep having issues even when I use 2 shards and track_total_hits: 50000.
I identified 3 issues:

  • For some queries, I'm getting error index_out_of_bounds_exception for both shards;
  • For some other queries, when I get successful responses, I identified that:
    • The results are not reproducible; they change when I try to hit again. For comparison, query A or query B isolated in the same index is providing me reproducible results, which makes me believe that the issue is related to hybrid search;
    • The results aren't a combination of query A and query B results according to the hybrid search definitions. For instance, there are some results that don't appear as results for query A or query B, but appear as first results for hybrid search;

I shared with David Fowler the artifacts for reproduction of the issue

@martin-gaievski
Copy link
Member

Hi @tiagoshin
For queries where you're getting index_out_of_bounds_exception is there a server log available, similar to what you've provided in initial PR description? If yes, can you please share it. This exception is pretty generic and it's hard to tell just from its name where it's coming from.

for the issue 2 I think what's happening is that some results are having exactly same score after normalization and when combined some of them may be pushed down out of the final result list. As this depends on order of execution of individual sub-queries every time result of such re-arrangement will look differently.

for the issue 3 what can happen is that a doc may receive higher combined score if it appears in results of let's say 2 sub-queries rather than in only one sub-query, even if in that one result this doc is high. For example, let's say we have sub-query A and B, and each return following results: A = [doc1: 0.7, doc2: 0.6, doc3: 0.5] and B = [doc4: 0.7, doc5: 0.6, doc3: 0.5]. If we use arithmetic mean for combination, then final result will look like: [doc3: 0.5, doc1: 0.35, doc4: 0.35].

@tiagoshin
Copy link
Author

@martin-gaievski Here are the logs:
[2024-01-24T22:47:43,188][WARN ][r.suppressed ] [2171808a1c467ff0235c64b24d841aac] path: __PATH__ params: {index=index}Failed to execute phase [query], all shards failed; shardFailures {[Vq9bCKylQyiRxmS78Azwtg][index][0]: RemoteTransportException[[c8a686a3b4ea2af6011c66cfb2dffbcd][__IP__][__PATH__[__PATH__]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: NotSerializableExceptionWrapper[index_out_of_bounds_exception: null]; }{[7HN8cHU6TPq_jl3l1cwWTA][index][1]: RemoteTransportException[[8d30fa4f7db7a4545505fc4ebc06002c][__IP__][__PATH__[__PATH__]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: NotSerializableExceptionWrapper[index_out_of_bounds_exception: 634]; } at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:716) at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:381) at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:755) at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:513) at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:303) at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:753) at org.opensearch.transport.TransportService$6.handleException(TransportService.java:903) at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1526) at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:438) at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:412) at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:436) at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:428) at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:166) at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:123) at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1471) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1334) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1383) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at __PATH__(Thread.java:833)Caused by: NotSerializableExceptionWrapper[index_out_of_bounds_exception: null] at java.nio.Buffer.checkIndex(Buffer.java:743) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:332) at org.apache.lucene.store.ByteBufferGuard.getByte(ByteBufferGuard.java:115) at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readByte(ByteBufferIndexInput.java:564) at org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$8.longValue(Lucene90NormsProducer.java:443) at org.apache.lucene.search.LeafSimScorer.getNormValue(LeafSimScorer.java:47) at org.apache.lucene.search.LeafSimScorer.score(LeafSimScorer.java:60) at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.WANDScorer.score(WANDScorer.java:527) at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:193) at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:61) at org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:65) at org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:193) at org.apache.lucene.search.ReqOptSumScorer.score(ReqOptSumScorer.java:273) at org.opensearch.neuralsearch.query.HybridQueryScorer.hybridScores(HybridQueryScorer.java:137) at org.opensearch.neuralsearch.search.HybridTopScoreDocCollector$1.collect(HybridTopScoreDocCollector.java:65) at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:274) at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:254) at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71) at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:322) at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:281) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:551) at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWithCollector(HybridQueryPhaseSearcher.java:219) at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:72) at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:282) at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:155) at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:533) at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:597) at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:566) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.lang.Thread.run(Thread.java:833)

@tiagoshin
Copy link
Author

tiagoshin commented Jan 26, 2024

@martin-gaievski
I understand your reasoning regarding topic 2 or 3, but I can ensure that's not the case.
Let me give some examples:
Setup:
I'm using normalizer l2 and combiner arithmetic_mean. Weights are 0.5 and 0.5. Results for Query A and Query B are always the same, so I'll focus on getting some results from the hybrid query and tell you where they were on queries A and B.

1st hybrid search run
Looking for document X

  • Query A: Score: 53.25505. Position: 13th with the same score as 12th
  • Query B: Score: ?. Position: None. Doesn't appear on result
  • Hybrid search query: Score: 0.28119355. Position: 1st

2nd hybrid search run
Looking for document Y

  • Query A: Score: 41.551655. Position: 19th
  • Query B: Score: ?. Position: None. Doesn't appear on result
  • Hybrid search query: Score: 0.16391976. Position: 3rd

3rd hybrid search run
Looking for document Z

  • Query A: Score: ?. Position: None. Doesn't appear on result
  • Query B: Score: ?. Position: None. Doesn't appear on result
  • Hybrid search query: 0.22679898. Position: 1st

In conclusion:

  • The hybrid results are very distinct between runs;
  • The calculations in most cases aren't correct. It's not hard to find many more documents that could be ranked higher than a result that appears in 1st place in a hybrid search.

I'll send you all the queries and results privately, so you can check yourself.

@ryanbogan ryanbogan added v2.13.0 and removed v2.12.0 Issues targeting release v2.12.0 labels Feb 22, 2024
marcus-bcl added a commit to ministryofjustice/probation-offender-search that referenced this issue Jul 29, 2024
Workaround for a hybrid query bug in OpenSearch - opensearch-project/neural-search#497
@navneet1v
Copy link
Collaborator

@martin-gaievski can we close this issue as the bug is resolved.

@martin-gaievski
Copy link
Member

yes, code wise we took care of the problem in #524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.15.0
Projects
Status: Done
Development

No branches or pull requests

8 participants