Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch cluster blocked by frozen task #35338

Closed
KeyZer opened this issue Nov 7, 2018 · 8 comments
Closed

Elasticsearch cluster blocked by frozen task #35338

KeyZer opened this issue Nov 7, 2018 · 8 comments
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed

Comments

@KeyZer
Copy link

KeyZer commented Nov 7, 2018

Elasticsearch version (bin/elasticsearch --version):
Version: 6.3.0, Build: default/deb/424e937/2018-06-11T23:38:03.357887Z, JVM: 1.8.0_162

Plugins installed: [analysis-icu]

JVM version (java -version):
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux search01 4.4.0-1070-aws #80-Ubuntu SMP Thu Oct 4 13:56:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

An urgent cluster task blocked all other tasks that makes changes the cluster state (node join/leaves, index setting change...). Basically it force the cluster into a read only state and is impossible to scale up or down since the node join tasks also gets blocked by the frozen high priority task. There should at least be a timeout if a task takes many days to complete (see below for output of GET _cluster/pending_tasks).

We solved this by migrating to a new cluster but that might not always be possible. After that we migrated to the new cluster we did a restart of the master node in the failing cluster and the failing cluster started to work as normal again after a while.

Steps to reproduce:

We have not been able to reproduce the issue, this happened after 6 months in production where we have increased the load on the cluster. It might have been our autoscaling of the Elasticsearch cluster that triggered the issue but it have been running for many months without issue. The only change we did the last few weeks was increasing the read load and started to adding nodes more rapidly to the cluster (two at the same time).

Provide logs (if relevant):

GET _cluster/pending_tasks
{
  "tasks": [
    {
      "insert_order": 4903913,
      "priority": "URGENT",
      "source": "shard-started StartedShardEntry{shardId [[hsearch_place-1541119291-1541107201707][0]], allocationId [HDkSthfcRyq6-1zdO0N1NA], message [after peer recovery]}",
      "executing": true,
      "time_in_queue_millis": 351128747,
      "time_in_queue": "4d"
    },
    {
      "insert_order": 4904859,
      "priority": "IMMEDIATE",
      "source": "zen-disco-node-left({search10}{UpapDMtiQQGsHt18XU_Obg}{hY9KJfUzRYCztq4N3fCpxw}{10.77.8.13}{10.77.8.13:9300}{ml.machine_memory=32899215360, rack=eu-west-1a, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(left)",
      "executing": false,
      "time_in_queue_millis": 308604643,
      "time_in_queue": "3.5d"
    },
    ....
}

Logfile from the master node

[2018-11-02T06:07:00,964][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [2] for indices [hsearch_place-1541119291-1541107201707]
[2018-11-02T06:07:00,989][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [2] for indices [hsearch_product-1541121966-1541107201707]
[2018-11-02T06:07:01,019][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [2] for indices [hsearch_autocomplete-1541120879-1541107201707]
[2018-11-02T06:09:29,139][INFO ][o.e.c.s.MasterService    ] [search01] zen-disco-node-join, reason: added {{search06}{O3BIQMIFR2-MIOrR_8Rswg}{5SmazoiiQVuH8LiuZbm5dQ}{10.77.8.12}{10.77.8.12:9300}{ml.machine_memory=32899215360, rack=eu-west-1a, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2018-11-02T06:09:32,206][INFO ][o.e.c.s.ClusterApplierService] [search01] added {{search06}{O3BIQMIFR2-MIOrR_8Rswg}{5SmazoiiQVuH8LiuZbm5dQ}{10.77.8.12}{10.77.8.12:9300}{ml.machine_memory=32899215360, rack=eu-west-1a, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {search01}{ACzv-eLuR8OGKGaVfzrZBQ}{w7ppkDQAQKe2HzQeRJ1Hjw}{10.77.8.10}{10.77.8.10:9300}{ml.machine_memory=8002396160, rack=eu-west-1a, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [666678] source [zen-disco-node-join]])
[2018-11-02T06:10:35,393][INFO ][o.e.c.m.MetaDataMappingService] [search01] [hsearch_company-1541117051-1541107201707/xqp3YzGMRuqTWNYmOLwLbw] update_mapping [master]
[2018-11-02T06:10:49,685][INFO ][o.e.c.s.MasterService    ] [search01] zen-disco-node-join, reason: added {{search05}{bD5y0cMSTrej0HTTsm2Klg}{Do-92dxsQhygLBFyYyXU-w}{10.77.8.11}{10.77.8.11:9300}{ml.machine_memory=32899215360, rack=eu-west-1a, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2018-11-02T06:10:52,709][INFO ][o.e.c.s.ClusterApplierService] [search01] added {{search05}{bD5y0cMSTrej0HTTsm2Klg}{Do-92dxsQhygLBFyYyXU-w}{10.77.8.11}{10.77.8.11:9300}{ml.machine_memory=32899215360, rack=eu-west-1a, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {search01}{ACzv-eLuR8OGKGaVfzrZBQ}{w7ppkDQAQKe2HzQeRJ1Hjw}{10.77.8.10}{10.77.8.10:9300}{ml.machine_memory=8002396160, rack=eu-west-1a, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [666680] source [zen-disco-node-join]])
[2018-11-02T06:11:02,188][INFO ][o.e.c.m.MetaDataMappingService] [search01] [hsearch_person-1541122087-1541107201707/s7AxcsI8T1mxdEd1J6-jbw] update_mapping [master]
[2018-11-02T06:14:00,983][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [3] for indices [hsearch_person-1541122087-1541107201707]
[2018-11-02T06:18:26,704][INFO ][o.e.c.r.a.AllocationService] [search01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[hsearch_person-1541122087-1541107201707][4]] ...]).
[2018-11-02T06:23:01,443][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [3] for indices [hsearch_person-1541122087-1541107201707]
[2018-11-02T06:23:01,503][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [search01] updating number_of_replicas to [3] for indices [hsearch_company-1541117051-1541107201707]
[2018-11-02T06:23:01,652][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [search01] [hsearch_company-1541117051-1541107201707][1]: failed to list shard for shard_store on node [O3BIQMIFR2-MIOrR_8Rswg]
org.elasticsearch.action.FailedNodeException: Failed node [O3BIQMIFR2-MIOrR_8Rswg]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:237) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:153) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:211) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1095) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport.lambda$handleException$34(TcpTransport.java:1510) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135) [elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1508) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1500) [elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1430) [elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) [transport-netty4-6.3.0.jar:6.3.0]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
Caused by: org.elasticsearch.transport.RemoteTransportException: [search06][10.77.8.12:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[hsearch_company-1541117051-1541107201707][1]]
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:111) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
    ... 1 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(MMapDirectory@/data/elasticsearch/nodes/0/indices/xqp3YzGMRuqTWNYmOLwLbw/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7961bae4): files: [recovery.tMp0C0AQSb-jFrqFryYFnA._16.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._16.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._16.si, recovery.tMp0C0AQSb-jFrqFryYFnA._16_5.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._1j.si, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_9.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._1j_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._21.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._21.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._21.fdt, recovery.tMp0C0AQSb-jFrqFryYFnA._21.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._21.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._21.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._21.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._21.si, recovery.tMp0C0AQSb-jFrqFryYFnA._21_8.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._21_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._21_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._2c.si, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_8.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._2c_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._2l.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._2l.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._2l.si, recovery.tMp0C0AQSb-jFrqFryYFnA._2l_8.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._2w.si, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_a.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._2w_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._2y.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._2y.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._2y.si, recovery.tMp0C0AQSb-jFrqFryYFnA._30.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._30.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._30.si, recovery.tMp0C0AQSb-jFrqFryYFnA._30_1.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._35.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._35.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._35.si, recovery.tMp0C0AQSb-jFrqFryYFnA._35_2.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._38.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._38.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._38.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3b.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3b.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3b.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3e.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3e.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3e.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3f.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3f.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3f.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3f_1.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._3g.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3g.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3g.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3h.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3h.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3h.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3i.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3i.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3i.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3j.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3j.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3j.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3j_1.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._3k.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3k.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3k.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3l.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3l.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3l.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3m.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3m.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3m.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3n.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3n.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3n.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3o.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3o.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3o.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.fdt, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._3p.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_1.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._3p_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._3q.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3q.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3q.si, recovery.tMp0C0AQSb-jFrqFryYFnA._3r.cfe, recovery.tMp0C0AQSb-jFrqFryYFnA._3r.cfs, recovery.tMp0C0AQSb-jFrqFryYFnA._3r.si, recovery.tMp0C0AQSb-jFrqFryYFnA._v.dii, recovery.tMp0C0AQSb-jFrqFryYFnA._v.dim, recovery.tMp0C0AQSb-jFrqFryYFnA._v.fdx, recovery.tMp0C0AQSb-jFrqFryYFnA._v.fnm, recovery.tMp0C0AQSb-jFrqFryYFnA._v.nvd, recovery.tMp0C0AQSb-jFrqFryYFnA._v.nvm, recovery.tMp0C0AQSb-jFrqFryYFnA._v.si, recovery.tMp0C0AQSb-jFrqFryYFnA._v_8.liv, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene50_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene50_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene50_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene50_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene50_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene70_0.dvd, recovery.tMp0C0AQSb-jFrqFryYFnA._v_Lucene70_0.dvm, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.cmp, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.doc, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.lkp, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.pay, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.pos, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.tim, recovery.tMp0C0AQSb-jFrqFryYFnA._v_completion_0.tip, recovery.tMp0C0AQSb-jFrqFryYFnA.segments_v, write.lock]
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:670) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:627) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:434) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:122) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:207) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.store.Store.access$200(Store.java:134) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:864) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:797) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.store.Store.getMetadata(Store.java:293) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1138) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:125) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:109) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
    ... 1 more
@dliappis
Copy link
Contributor

dliappis commented Nov 7, 2018

Hello @KeyZer,

Thanks for reaching out.

I'd like to understand better is this a feature request (autoexpiration of old tasks) or a general question?

@KeyZer
Copy link
Author

KeyZer commented Nov 7, 2018

I would classify it as a bug report. The cluster should never end up in a state where no changes can be done to the cluster state because of a high priority pending task has not been completed for X days. The task probably got frozen because there was an issue with creating a shard (see log file) but the cluster keeps waiting for the task to complete. Autoexpiration/heart beats of tasks are a possible solution to the issue.

@dliappis dliappis added :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. and removed feedback_needed labels Nov 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dliappis dliappis added the >bug label Nov 7, 2018
@ywelsch ywelsch added feedback_needed :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >bug labels Nov 7, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Nov 7, 2018

The issue here has nothing to do with task priorities or cancelling / auto-expiring tasks. We execute the cluster state update tasks using a single threaded executor, and executing of a task here seems to have gotten stuck (see "executing": true), indefinitely blocking the single thread of the executor. It would be interesting to find out where this thread is hanging. Can you provide hot_threads or jstack output of the master node?

@KeyZer
Copy link
Author

KeyZer commented Nov 8, 2018

After we restart the master node in the cluster the pending_tasks cleared so running hot_threads or jstack will not help now unfortunately. I have all the elasticsearch logfiles from that day if it helps.

@ywelsch
Copy link
Contributor

ywelsch commented Nov 8, 2018

I don't expect the log files to help unless you see a warning of the form uncaught exception in thread or fatal error in thread in there. In case you observe the same issue again, please gather the above info. I'm afraid there's nothing else we can do here until then.

@KeyZer
Copy link
Author

KeyZer commented Nov 8, 2018

There was no errors like that in the log file. If it happens again I will get a stack of all threads.

@iverase
Copy link
Contributor

iverase commented Dec 10, 2018

No further feedback received. @KeyZer, if you have the requested
information please add it in a comment and we can look at re-opening
this issue.

@iverase iverase closed this as completed Dec 10, 2018
jkakavas added a commit that referenced this issue Dec 18, 2018
This change:

- Adds functionality to invalidate all (refresh+access) tokens for all users of a realm
- Adds functionality to invalidate all (refresh+access)tokens for a user in all realms
- Adds functionality to invalidate all (refresh+access) tokens for a user in a specific realm
- Changes the response format for the invalidate token API to contain information about the
   number of the invalidated tokens and possible errors that were encountered.
- Updates the API Documentation

Relates: #35338
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants