Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NullPointerException in delete_by_query when index is deleted #8418

Open
blampe opened this issue Jul 3, 2023 · 7 comments
Open

[BUG] NullPointerException in delete_by_query when index is deleted #8418

blampe opened this issue Jul 3, 2023 · 7 comments

Comments

@blampe
Copy link

blampe commented Jul 3, 2023

Describe the bug

delete_by_query with a wildcard pattern can return a 500 if an underlying index is deleted while the query is running.

I've seen two stack traces which seem to reflect the different phases of the delete_by_query that can encounter the race condition:

java.lang.NullPointerException: Cannot invoke "org.opensearch.cluster.metadata.IndexMetadata.getSettings()" because the return value of "org.opensearch.cluster.metadata.Metadata.index(String)" is null
    at org.opensearch.indexmanagement.rollup.util.RollupUtilsKt.isRollupIndex(RollupUtils.kt:70)
    at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:81)
    at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestH...+3903 more"
Failed to execute phase [can_match], Partial shards failure; ... RemoteTransportException[[36a7dbd9822c64d770ce4c655e56ec40][__IP__][__PATH__[can_match]]]; nested: NullPointerException; }
    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:674)
    at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:400)
    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:709)
    at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:482)
    at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:297)
    at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74)
    ...

To Reproduce

This has been reproducible in our production setup which consists of a steady stream of delete_by_query operations and occasional alias updates (concurrent with the deletes). The fact that there's an alias involved may or may not be relevant -- I haven't attempted to reproduce the issue with a plain index delete operation.

High-level steps to reproduce the behavior:

  1. Create two indices foo-1 and foo-2.
  2. Create an alias foo containing foo-1 and foo-2.
  3. Concurrently:
    a. Perform a delete_by_query on foo* (we use slices=auto and conflict=proceed).
    b. Update the foo alias with {"actions": [{"remove_index": {"index": "foo-2"}}]}.

Expected behavior

I would expect at least a non-500 status code. Perhaps a 404 in the case where ignore_unavailable is false and a 200 in the case where ignore_unavailable is true.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Linux
  • Version 2.5

Additional context

Setting ignore_unavailable seems to have no impact on the behavior.

@blampe blampe added bug Something isn't working untriaged labels Jul 3, 2023
@dblock
Copy link
Member

dblock commented Jul 6, 2023

Looks like a legit bug. Do you think you could reproduce this in a YAML REST test?

@blampe
Copy link
Author

blampe commented Jul 12, 2023

@dblock from briefly looking at the YAML test framework it seems the requests all run sequentially? To repro this the index removal would need to run concurrent with the delete_by_query. I think that would look roughly like the below -- let me know if there's an easy way to run the two do steps at the same time.

(Note that even if the steps are run concurrently, I wouldn't be surprised if there's not enough data to make the delete_by_query take long enough to race with the index removal.)

---
 REPRO:
   - do:
       index:
         index: test-1
         id: 1
         body: { "text": "test-1" }

   - do:
       index:
         index: test-2
         id: 2
         body: { "text": "test-2" }

   - do:
       indices.update_aliases:
         body:
           actions:
             - add:
                 index: test-1
                 aliases: [test]
             - add:
                 index: test-2
                 aliases: [test]

   - do:
       indices.refresh: {}

   - do:
       delete_by_query:
         index: test*
         slices: auto
         conflict: proceed
         ignore_unavailable: true
         body:
           query:
             match_all: {}

   - do: # Needs to be concurrent with the delete_by_query
       indices.update_aliases:
         body:
           actions:
             - remove_index:
                 index: test-2

   - match: { failures: [] }

@dblock
Copy link
Member

dblock commented Jul 17, 2023

I don't love the idea that the test may sometimes pass and sometimes fail. Maybe these YAML tests simply can't express this scenario and we're trying too hard?

@blampe
Copy link
Author

blampe commented Jul 17, 2023

I don't love the idea that the test may sometimes pass and sometimes fail. Maybe these YAML tests simply can't express this scenario and we're trying too hard?

I agree, reproducing the race condition is probably more work than it's worth.

There's probably a unit of logic here that needs to handle the index-no-longer-exists case more gracefully. Maybe this is isRollupIndex? It's hard for me to read the trace.

@dblock
Copy link
Member

dblock commented Jul 17, 2023

@blampe I really don't know this code, if you want to keep digging I would focus on some test that reproduces this problem 100% of the time, it can be pretty low level - if it's very hard to test then maybe it should be telling something - anyway I am spitballing

@andrross
Copy link
Member

It should be possible to create a deterministic reproduction, but it will likely require a test that injects latches or something that isn't possible at the YAML test level. It should be possible as in integration test though (sorry I can't be more specific without diving a lot deeper into this issue).

@blampe
Copy link
Author

blampe commented Jul 24, 2023

opensearch-project/index-management#855 might address the underlying issue.

I imagine a library dependency needs to get bumped to pull in that fix, but I'm not sure how to do that. (If this is using index-management 2.x, the fix is being back ported in opensearch-project/index-management#871.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New
Development

No branches or pull requests

6 participants