You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elasticsearch version (bin/elasticsearch --version):
7.10.2 (but same happened on 7.8.0 from which we upgraded to 7.10.2 in hope that the problem will get fixed)
Plugins installed: [repository-azure]
JVM version (java -version):
openjdk version "15.0.1" 2020-10-20
OpenJDK Runtime Environment AdoptOpenJDK (build 15.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 15.0.1+9, mixed mode, sharing)
OS version (uname -a if on a Unix-like system):
Linux core-opendistro-es-master-0 5.4.0-1055-azure #57~18.04.1-Ubuntu SMP Fri Jul 16 19:40:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We are running opendistro-for-elasticsearch using the helm chart from here: https://github.com/opendistro-for-elasticsearch/opendistro-build with one modification (the elasticsearch image has repository-azure plugin installed on top of its standard content).
Everything else is configured via helm release values and pretty standard.
For last 2-3 weeks every 1 to 3 days the cluster just stops processing tasks completely.
This is due to one stuck task. The stuck task can be observed in GET _cluster/pending_tasks query.
This has happened already several times to us and each time it was a different task that was stuck so I cannot say it is one particular reason for this. The current workaround is to kill the pods and allow master nodes restart.
Examples of GET _cluster/pending_tasks query when this bug happened to us. There is always one pending task with "executing": true, the rest (and there is eventually a lot of other pending tasks) remain as "executing": false.
On 7.8.0 version (so before we upgraded in hope it will get fixed):
On 7.10.2 I unfortunately lost the record but it was a carbon copy of the top one for 7.8.0 (we just have opendistro-im as source instead of opendistro-ism):
{
"tasks" : [
{
"insert_order" : some insert order,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : true,
"time_in_queue_millis" : a really large number,
"time_in_queue" : "some time in days!"
},
I have checked and when the cluster is in that state the query GET _cat/shards?v=true reports (I trimmed it down as all the records show STARTED and there is > 1000 of them) everything as started:
I am assuming that this is rather a result of that task being stuck than the cause of it...
index shard prirep state docs store ip node
azure-postgres-2021.07.14 1 r STARTED 54971 20.6mb <IP> core-opendistro-es-data-0
azure-postgres-2021.07.14 1 p STARTED 54971 20.6mb <IP> core-opendistro-es-data-1
azure-postgres-2021.07.14 0 p STARTED 54994 20.6mb <IP>core-opendistro-es-data-2
azure-postgres-2021.07.14 0 r STARTED 54994 20.6mb <IP> core-opendistro-es-data-0
logstash-python-logstash-opt-2021.09.01 1 p STARTED 175226 246.3mb <IP> core-opendistro-es-data-2
logstash-python-logstash-opt-2021.09.01 1 r STARTED 175226 246.3mb <IP> core-opendistro-es-data-1
But the result of GET _cluster/allocation/explain shows no issues, there are no unallocated shards etc.:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
}
],
"type" : "illegal_argument_exception",
"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
},
"status" : 400
}
The result of GET _cluster/health?filter_path=status,*_shards when we are stuck shows a healthy cluster :):
Checking result of GET _tasks when cluster is hanging returns (this was after it got stuck for 3 days) about 2.8 mil lines of json response. All have action indices:data/write/bulk (or a variant with [s] or other) and have some astronomical running time which is expected since nothing can proceed forward...
As expected the result of GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed shows a growing number of rejected writes, but this is a result, not a cause for me, just reporting for sake of completeness here (again pasting in only the relevant fragment of response).
id name active rejected completed
OPUWg9MRQjG-RbYZYqu6qA write 0 18849 1709160
tQBXjgZiQLehN8sptb11zw write 4 23137 1920852
Unfortunately at the time this happened no one could log in and get the GET _nodes/hot_threads query, but since this is a problem that will come back I will add the response here when it happens again. My suspicion is the thread responsible for processing the tasks is simply hanging on something.
Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.
This is not reproducible easily. Happens once in 2 days usually. Suspect a deadlock.
Hopefully once this happens again for us, I can provide hot_threads response and this will be of use to guide us to the real source of the problem.
If you have any suggestions on what other queries/stats can be checked - please let me know.
Provide logs (if relevant):
I have looked into the logs of all master nodes (we are running 3) and also data nodes just in case at the time of the incident and there is absolutely nothing logged suggesting what the problem could be (no warn or error logs or anything weird).
Again I am expecting this to be the case since my suspicion is a thread responsible for task processing is hanging so nothing will get logged...
The text was updated successfully, but these errors were encountered:
Looking at this, the task that is actually stuck appears to be opendistro-ism, which is a task run by that plugin. As such, there isn't really any diagnosing we can do for that as it's a 3rd party plugin. I would recommend that you open this request on the opendistro-ism Github repository. Since this is not part of Elasticsearch itself, I am going to close this issue.
Elasticsearch version (
bin/elasticsearch --version
):7.10.2 (but same happened on 7.8.0 from which we upgraded to 7.10.2 in hope that the problem will get fixed)
Plugins installed: [repository-azure]
JVM version (
java -version
):OS version (
uname -a
if on a Unix-like system):Description of the problem including expected versus actual behavior:
We are running
opendistro-for-elasticsearch
using the helm chart from here: https://github.com/opendistro-for-elasticsearch/opendistro-build with one modification (the elasticsearch image hasrepository-azure
plugin installed on top of its standard content).Everything else is configured via helm release values and pretty standard.
For last 2-3 weeks every 1 to 3 days the cluster just stops processing tasks completely.
This is due to one stuck task. The stuck task can be observed in
GET _cluster/pending_tasks
query.This has happened already several times to us and each time it was a different task that was stuck so I cannot say it is one particular reason for this. The current workaround is to kill the pods and allow master nodes restart.
Examples of
GET _cluster/pending_tasks
query when this bug happened to us. There is always one pending task with"executing": true,
the rest (and there is eventually a lot of other pending tasks) remain as"executing": false
.On 7.8.0 version (so before we upgraded in hope it will get fixed):
On 7.10.2 I unfortunately lost the record but it was a carbon copy of the top one for 7.8.0 (we just have
opendistro-im
as source instead ofopendistro-ism
):I have checked and when the cluster is in that state the query
GET _cat/shards?v=true
reports (I trimmed it down as all the records show STARTED and there is > 1000 of them) everything as started:I am assuming that this is rather a result of that task being stuck than the cause of it...
But the result of
GET _cluster/allocation/explain
shows no issues, there are no unallocated shards etc.:The result of
GET _cluster/health?filter_path=status,*_shards
when we are stuck shows a healthy cluster :):Checking result of
GET _tasks
when cluster is hanging returns (this was after it got stuck for 3 days) about 2.8 mil lines of json response. All have actionindices:data/write/bulk
(or a variant with [s] or other) and have some astronomical running time which is expected since nothing can proceed forward...Showing just one record for completeness sake:
As expected the result of
GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
shows a growing number of rejected writes, but this is a result, not a cause for me, just reporting for sake of completeness here (again pasting in only the relevant fragment of response).Unfortunately at the time this happened no one could log in and get the
GET _nodes/hot_threads
query, but since this is a problem that will come back I will add the response here when it happens again. My suspicion is the thread responsible for processing the tasks is simply hanging on something.Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.
This is not reproducible easily. Happens once in 2 days usually. Suspect a deadlock.
Hopefully once this happens again for us, I can provide
hot_threads
response and this will be of use to guide us to the real source of the problem.If you have any suggestions on what other queries/stats can be checked - please let me know.
Provide logs (if relevant):
I have looked into the logs of all master nodes (we are running 3) and also data nodes just in case at the time of the incident and there is absolutely nothing logged suggesting what the problem could be (no warn or error logs or anything weird).
Again I am expecting this to be the case since my suspicion is a thread responsible for task processing is hanging so nothing will get logged...
The text was updated successfully, but these errors were encountered: