[BUG] NullPointerException in `delete_by_query` when index is deleted #8418

blampe · 2023-07-03T22:42:59Z

Describe the bug

delete_by_query with a wildcard pattern can return a 500 if an underlying index is deleted while the query is running.

I've seen two stack traces which seem to reflect the different phases of the delete_by_query that can encounter the race condition:

java.lang.NullPointerException: Cannot invoke "org.opensearch.cluster.metadata.IndexMetadata.getSettings()" because the return value of "org.opensearch.cluster.metadata.Metadata.index(String)" is null
    at org.opensearch.indexmanagement.rollup.util.RollupUtilsKt.isRollupIndex(RollupUtils.kt:70)
    at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:81)
    at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestH...+3903 more"

Failed to execute phase [can_match], Partial shards failure; ... RemoteTransportException[[36a7dbd9822c64d770ce4c655e56ec40][__IP__][__PATH__[can_match]]]; nested: NullPointerException; }
    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:674)
    at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:400)
    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:709)
    at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:482)
    at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:297)
    at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74)
    ...

To Reproduce

This has been reproducible in our production setup which consists of a steady stream of delete_by_query operations and occasional alias updates (concurrent with the deletes). The fact that there's an alias involved may or may not be relevant -- I haven't attempted to reproduce the issue with a plain index delete operation.

High-level steps to reproduce the behavior:

Create two indices foo-1 and foo-2.
Create an alias foo containing foo-1 and foo-2.
Concurrently:
a. Perform a delete_by_query on foo* (we use slices=auto and conflict=proceed).
b. Update the foo alias with {"actions": [{"remove_index": {"index": "foo-2"}}]}.

Expected behavior

I would expect at least a non-500 status code. Perhaps a 404 in the case where ignore_unavailable is false and a 200 in the case where ignore_unavailable is true.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Linux
Version 2.5

Additional context

Setting ignore_unavailable seems to have no impact on the behavior.

The text was updated successfully, but these errors were encountered:

dblock · 2023-07-06T20:01:57Z

Looks like a legit bug. Do you think you could reproduce this in a YAML REST test?

blampe · 2023-07-12T17:23:49Z

@dblock from briefly looking at the YAML test framework it seems the requests all run sequentially? To repro this the index removal would need to run concurrent with the delete_by_query. I think that would look roughly like the below -- let me know if there's an easy way to run the two do steps at the same time.

(Note that even if the steps are run concurrently, I wouldn't be surprised if there's not enough data to make the delete_by_query take long enough to race with the index removal.)

---
 REPRO:
   - do:
       index:
         index: test-1
         id: 1
         body: { "text": "test-1" }

   - do:
       index:
         index: test-2
         id: 2
         body: { "text": "test-2" }

   - do:
       indices.update_aliases:
         body:
           actions:
             - add:
                 index: test-1
                 aliases: [test]
             - add:
                 index: test-2
                 aliases: [test]

   - do:
       indices.refresh: {}

   - do:
       delete_by_query:
         index: test*
         slices: auto
         conflict: proceed
         ignore_unavailable: true
         body:
           query:
             match_all: {}

   - do: # Needs to be concurrent with the delete_by_query
       indices.update_aliases:
         body:
           actions:
             - remove_index:
                 index: test-2

   - match: { failures: [] }

dblock · 2023-07-17T17:50:28Z

I don't love the idea that the test may sometimes pass and sometimes fail. Maybe these YAML tests simply can't express this scenario and we're trying too hard?

blampe · 2023-07-17T18:53:00Z

I don't love the idea that the test may sometimes pass and sometimes fail. Maybe these YAML tests simply can't express this scenario and we're trying too hard?

I agree, reproducing the race condition is probably more work than it's worth.

There's probably a unit of logic here that needs to handle the index-no-longer-exists case more gracefully. Maybe this is isRollupIndex? It's hard for me to read the trace.

dblock · 2023-07-17T20:43:53Z

@blampe I really don't know this code, if you want to keep digging I would focus on some test that reproduces this problem 100% of the time, it can be pretty low level - if it's very hard to test then maybe it should be telling something - anyway I am spitballing

andrross · 2023-07-18T20:26:16Z

It should be possible to create a deterministic reproduction, but it will likely require a test that injects latches or something that isn't possible at the YAML test level. It should be possible as in integration test though (sorry I can't be more specific without diving a lot deeper into this issue).

blampe · 2023-07-24T17:20:36Z

opensearch-project/index-management#855 might address the underlying issue.

I imagine a library dependency needs to get bumped to pull in that fix, but I'm not sure how to do that. (If this is using index-management 2.x, the fix is being back ported in opensearch-project/index-management#871.)

blampe added bug Something isn't working untriaged labels Jul 3, 2023

minalsha added the distributed framework label Jul 14, 2023

blampe mentioned this issue Jul 17, 2023

Handle NPE in isRollupIndex opensearch-project/index-management#855

Merged

1 task

anasalkouz removed the untriaged label Jul 18, 2023

khushbr added the Cluster Manager label Jan 29, 2024

jainankitk mentioned this issue Feb 19, 2024

[BUG] Concurrent index deletion during ongoing search request throws 5xx error #12371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NullPointerException in `delete_by_query` when index is deleted #8418

[BUG] NullPointerException in `delete_by_query` when index is deleted #8418

blampe commented Jul 3, 2023

dblock commented Jul 6, 2023

blampe commented Jul 12, 2023

dblock commented Jul 17, 2023

blampe commented Jul 17, 2023

dblock commented Jul 17, 2023

andrross commented Jul 18, 2023

blampe commented Jul 24, 2023

[BUG] NullPointerException in delete_by_query when index is deleted #8418

[BUG] NullPointerException in delete_by_query when index is deleted #8418

Comments

blampe commented Jul 3, 2023

dblock commented Jul 6, 2023

blampe commented Jul 12, 2023

dblock commented Jul 17, 2023

blampe commented Jul 17, 2023

dblock commented Jul 17, 2023

andrross commented Jul 18, 2023

blampe commented Jul 24, 2023

[BUG] NullPointerException in `delete_by_query` when index is deleted #8418

[BUG] NullPointerException in `delete_by_query` when index is deleted #8418