Improve TaskBatcher performance in case of a datacenter failure #41406

incubos · 2019-04-22T10:49:03Z

We have a production cluster containing several hundred shards distributed between several datacenters. In case of a datacenter failure it takes 10-15 min for the cluster to converge and resume normal functioning, which is not quite acceptable.
Stack trace analysis shows that all transport_server_worker.default threads are blocked on TaskBatcher.submitTasks() during that downtime interval:

"elasticsearch[XXX][[transport_server_worker.default]][T#72]" - Thread t@251
   java.lang.Thread.State: BLOCKED
        at org.elasticsearch.cluster.service.TaskBatcher.submitTasks(TaskBatcher.java:70)
        - waiting to lock <7d67b8b6> (a java.util.HashMap) owned by "elasticsearch[XXX][[transport_server_worker.default]][T#91]" t@270
        at org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTasks(ClusterService.java:476)
        at org.elasticsearch.cluster.service.ClusterService.submitStateUpdateTask(ClusterService.java:450)
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:211)
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedTransportHandler.messageReceived(ShardStateAction.java:197)
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110)
        at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1513)
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1396)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:75)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:745)

At the same time TaskBatcher queue contains tens of thousands tasks.
There are two major causes for the bottleneck:

TaskBatcher effectively serializes attempts to submit state update tasks due to use of serialized(tasksPerBatchingKey)
TaskBatcher compares each added task to all the existing tasks in O(n) to ensure no duplicate exists

The problem can be easily reproduced using v5.6.11 release. TaskBatcher remains unchanged in master.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-22T11:08:52Z

Pinging @elastic/es-distributed

ywelsch · 2019-04-23T06:47:06Z

This should be fixed in 6.4.0+ by #31313. The problem was that the nodes kept flooding the master with shard failure requests.

incubos · 2019-04-23T11:49:47Z

@ywelsch do you think #41407 is unnecessary or should we try 6.x first?

ywelsch · 2020-05-27T12:01:11Z

I'm closing this one, as it's not clear this is affecting any recent ES versions (after 6.4.0+)

incubos mentioned this issue Apr 22, 2019

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

Closed

iverase added the :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label Apr 22, 2019

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

ywelsch closed this as completed May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TaskBatcher performance in case of a datacenter failure #41406

Improve TaskBatcher performance in case of a datacenter failure #41406

incubos commented Apr 22, 2019

elasticmachine commented Apr 22, 2019

ywelsch commented Apr 23, 2019

incubos commented Apr 23, 2019

ywelsch commented May 27, 2020

Improve TaskBatcher performance in case of a datacenter failure #41406

Improve TaskBatcher performance in case of a datacenter failure #41406

Comments

incubos commented Apr 22, 2019

elasticmachine commented Apr 22, 2019

ywelsch commented Apr 23, 2019

incubos commented Apr 23, 2019

ywelsch commented May 27, 2020