Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sporadic concurrent_modification_exception during query in 2.14 #14032

Closed
janheise opened this issue Jun 6, 2024 · 4 comments · Fixed by #14221
Closed

[BUG] sporadic concurrent_modification_exception during query in 2.14 #14032

janheise opened this issue Jun 6, 2024 · 4 comments · Fixed by #14221
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0

Comments

@janheise
Copy link

janheise commented Jun 6, 2024

Describe the bug

As you can see from the screenshot, there is a ConcurrentModificationException going on.

Screenshot 2024-06-06 at 15 11 43

Graylog users that started to use OpenSearch 2.14 noticed that as a problem happening for them in queries, so we started to investigate.

The resulting output from an msearch, that carries this exception looks like the following:

"failed":1,"failures":[{"shard":0,"index":"graylog_0",
"node":"jxRdA49HT4uuwWu7VVGyjw","reason":{"type":"concurrent_modification_exception","reason":null}}]}

So there is no stacktrace or logs at all.

While trying to reproduce the problem, I was lucky to have the debugger attached that caught the exception/resulted in the screenshot above.

The following line in private void updateStaleCountOnCacheInsert(CleanupKey cleanupKey) { throws the exception:

cleanupKeyToCountMap.computeIfAbsent(shardId, k -> new HashMap<>()).merge(cleanupKey.readerCacheKeyId, 1, Integer::sum);

which was introduced with #12707 if I'm correct - which also means that it could have/should have probably already hit in 2.13?

The error condition seems to be a bit awkward to reproduce:

A graylog instance that has a random message generator running where I had the attached script/query running reproduced the error quite consistently every 2.5/3k queries against an OpenSearch 2.14 in docker.

Reproducing it, running OpenSearch via ./gradlew run and attaching the debugger takes ca. 40-50k queries until the error shows up.

msearch3-loop.sh.txt

msearch3.req.txt

The query stays identical but fails at some point. I think there needs to be some traffic on the index so that the query is evaluated every time and not cached.

Let me know if you need more infos.

Related component

Search

To Reproduce

We're working on a setup.

Expected behavior

no concurrent modification exception should occur

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@janheise
Copy link
Author

janheise commented Jun 7, 2024

@kiranprakash154 Hi, may I ask you to take a look? I think, because it's a backport, the error will also occur in 3.0.0?

@kiranprakash154
Copy link
Contributor

Hey @janheise, Thanks for reporting, let me take a look.

@kiranprakash154
Copy link
Contributor

@janheise what were the contents of your index - "graylog_0" ?
Can you provide me that ? It will be easier for me to repro this.

@janheise
Copy link
Author

@kiranprakash154 I attached two files: one that shows the index structure and some data. The data get's randomly generated. Is this what you need?

graylog0_def.json
graylog0_query.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0
Projects
Status: Done
4 participants