Elasticsearch cluster frozen by 1 pending task #80132

Tomasz-Kluczkowski · 2021-11-01T13:03:59Z

Elasticsearch version (bin/elasticsearch --version):
7.10.2 (but same happened on 7.8.0 from which we upgraded to 7.10.2 in hope that the problem will get fixed)

Plugins installed: [repository-azure]

JVM version (java -version):

openjdk version "15.0.1" 2020-10-20
OpenJDK Runtime Environment AdoptOpenJDK (build 15.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 15.0.1+9, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux core-opendistro-es-master-0 5.4.0-1055-azure #57~18.04.1-Ubuntu SMP Fri Jul 16 19:40:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We are running opendistro-for-elasticsearch using the helm chart from here: https://github.com/opendistro-for-elasticsearch/opendistro-build with one modification (the elasticsearch image has repository-azure plugin installed on top of its standard content).
Everything else is configured via helm release values and pretty standard.

For last 2-3 weeks every 1 to 3 days the cluster just stops processing tasks completely.

This is due to one stuck task. The stuck task can be observed in GET _cluster/pending_tasks query.
This has happened already several times to us and each time it was a different task that was stuck so I cannot say it is one particular reason for this. The current workaround is to kill the pods and allow master nodes restart.

Examples of GET _cluster/pending_tasks query when this bug happened to us. There is always one pending task with "executing": true, the rest (and there is eventually a lot of other pending tasks) remain as "executing": false.

On 7.8.0 version (so before we upgraded in hope it will get fixed):

{
      "insert_order" : 38880,
      "priority" : "NORMAL",
      "source" : "opendistro-ism",
      "executing" : true,
      "time_in_queue_millis" : 75175702,
      "time_in_queue" : "20.8h"
    }

{
      "insert_order" : 10322,
      "priority" : "URGENT",
      "source" : "shard-started StartedShardEntry{shardId [[metrics-postgresql-2021.10.23][0]], allocationId [fFRMI7EqRP-rP2-osyWfrQ], primary term [1], message [after new shard recovery]}",
      "executing" : true,
      "time_in_queue_millis" : 157249293,
      "time_in_queue" : "1.8d"
    },

On 7.10.2 I unfortunately lost the record but it was a carbon copy of the top one for 7.8.0 (we just have opendistro-im as source instead of opendistro-ism):

{
  "tasks" : [
    {
      "insert_order" : some insert order,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : true,
      "time_in_queue_millis" : a really large number,
      "time_in_queue" : "some time in days!"
    },

I have checked and when the cluster is in that state the query GET _cat/shards?v=true reports (I trimmed it down as all the records show STARTED and there is > 1000 of them) everything as started:
I am assuming that this is rather a result of that task being stuck than the cause of it...

index                                                      shard prirep state       docs    store ip            node
azure-postgres-2021.07.14                                  1     r      STARTED    54971   20.6mb <IP> core-opendistro-es-data-0
azure-postgres-2021.07.14                                  1     p      STARTED    54971   20.6mb <IP> core-opendistro-es-data-1
azure-postgres-2021.07.14                                  0     p      STARTED    54994   20.6mb <IP>core-opendistro-es-data-2
azure-postgres-2021.07.14                                  0     r      STARTED    54994   20.6mb <IP> core-opendistro-es-data-0
logstash-python-logstash-opt-2021.09.01                    1     p      STARTED   175226  246.3mb <IP> core-opendistro-es-data-2
logstash-python-logstash-opt-2021.09.01                    1     r      STARTED   175226  246.3mb <IP> core-opendistro-es-data-1

But the result of GET _cluster/allocation/explain shows no issues, there are no unallocated shards etc.:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

The result of GET _cluster/health?filter_path=status,*_shards when we are stuck shows a healthy cluster :):

{
  "status" : "green",
  "active_primary_shards" : 594,
  "active_shards" : 1189,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0
}

Checking result of GET _tasks when cluster is hanging returns (this was after it got stuck for 3 days) about 2.8 mil lines of json response. All have action indices:data/write/bulk (or a variant with [s] or other) and have some astronomical running time which is expected since nothing can proceed forward...

Showing just one record for completeness sake:

"DVlyTw4rQg25wfR1Acixkw:2621390" : {
          "node" : "DVlyTw4rQg25wfR1Acixkw",
          "id" : 2621390,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "start_time_in_millis" : 1635735655392,
          "running_time_in_nanos" : 25876552596589,
          "cancellable" : false,
          "parent_task_id" : "DVlyTw4rQg25wfR1Acixkw:2621389",
          "headers" : { }
        },

As expected the result of GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed shows a growing number of rejected writes, but this is a result, not a cause for me, just reporting for sake of completeness here (again pasting in only the relevant fragment of response).

id                     name                                   active rejected completed
OPUWg9MRQjG-RbYZYqu6qA write                                       0    18849   1709160
tQBXjgZiQLehN8sptb11zw write                                       4    23137   1920852

Unfortunately at the time this happened no one could log in and get the GET _nodes/hot_threads query, but since this is a problem that will come back I will add the response here when it happens again. My suspicion is the thread responsible for processing the tasks is simply hanging on something.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

This is not reproducible easily. Happens once in 2 days usually. Suspect a deadlock.
Hopefully once this happens again for us, I can provide hot_threads response and this will be of use to guide us to the real source of the problem.

If you have any suggestions on what other queries/stats can be checked - please let me know.

Provide logs (if relevant):

I have looked into the logs of all master nodes (we are running 3) and also data nodes just in case at the time of the incident and there is absolutely nothing logged suggesting what the problem could be (no warn or error logs or anything weird).
Again I am expecting this to be the case since my suspicion is a thread responsible for task processing is hanging so nothing will get logged...

The text was updated successfully, but these errors were encountered:

Tomasz-Kluczkowski · 2021-11-01T13:18:04Z

Also worth noting is seems similar to this: #35338 which I think was never resolved.

dakrone · 2021-11-03T15:45:17Z

Hi @Tomasz-Kluczkowski,

Looking at this, the task that is actually stuck appears to be opendistro-ism, which is a task run by that plugin. As such, there isn't really any diagnosing we can do for that as it's a 3rd party plugin. I would recommend that you open this request on the opendistro-ism Github repository. Since this is not part of Elasticsearch itself, I am going to close this issue.

Tomasz-Kluczkowski added >bug needs:triage Requires assignment of a team area label labels Nov 1, 2021

dakrone closed this as completed Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch cluster frozen by 1 pending task #80132

Elasticsearch cluster frozen by 1 pending task #80132

Tomasz-Kluczkowski commented Nov 1, 2021

Tomasz-Kluczkowski commented Nov 1, 2021

dakrone commented Nov 3, 2021

Elasticsearch cluster frozen by 1 pending task #80132

Elasticsearch cluster frozen by 1 pending task #80132

Comments

Tomasz-Kluczkowski commented Nov 1, 2021

Tomasz-Kluczkowski commented Nov 1, 2021

dakrone commented Nov 3, 2021