-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM on date_histogram with small interval #72619
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Bash script to reproduce the issue:
|
Thanks @sag-tobias-frey. Its interesting that you got there with nested! I hadn't realized that might be in the mix. Fun times. |
I have managed to get there even without nested:
|
I executed some test (with 0.5GB heap) running the query against the current master branch and I see that we still have an OOM but not the one described originally . If my understanding is correct the original OOM happened on the coordinator and has been fixed #72081. That patch makes sure we hit the circuit breaker on the coordinator before the OOM takes place. Right now, anyway, I see something different and my understanding is that the OOM is happening on the data node. What I see is that the method According to the heapdump these objects ( Increasing the heap to 2GB I get the following response (no OOM)
NOTE: the original query is actually calculating a cross product on the |
However, shouldn't the number of resulting buckets from the cross product of Have you tried sending multiple of these requests in parallel against the 2GB cluster? We noticed that we have to be careful with these kind of aggregation / distributions when we have parallel requests because then the OOM error might still occur with more heap occur because the circuit breaker does not detect it early enough. |
Regarding the cross product my understanding is different. If we have a three documents each with
Extending it to 1000 distinct values results in 1M buckets. Anyway, yes the problem is that the circuit breaker is not firing but I think this is not happening because the creation of objects like |
Removed |
I had a discussion with the team about this issue and the agreement is that it needs to be addresses by the following two issues:
|
The list of objects taking more space is the following:
Attaching a script which triggers the issue (test.txt). |
Just an update for posterity/those following along at home - this is mostly #77449 - our response objects are super wasteful and sometimes allocate so quickly the real memory breaker doesn't catch them. A dense representation would save us here. And help lots of other things. In the short run, I expect we could save some heap by reworking how |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
I recently merged #72081 which protects against OOM in the reduce phase for date_histograms. A discuss user reported having a similar issue, but they seem to have it on the data nodes while building results.
Elasticsearch version (
bin/elasticsearch --version
): Reported on 7.9 - @nik9000 thinks it should be possible to reproduce against masterSteps to reproduce:
We just got the stack trace in the linked discuss issue. Looks like they have a wide range and tight interval. They have
min_doc_count
set to0
but I don't think that matters too much here. I'd try setting one second bins a hundred thousand docs all in a different second. The trick, I think, is not to run out of memory when collecting the agg - we have protections there - but to run out of memory when building bloaty result objects we send back to the coordinating node.The text was updated successfully, but these errors were encountered: