Improve TaskBatcher performance in case of a datacenter failure #41406
Labels
:Distributed/Distributed
A catch all label for anything in the Distributed Area. If you aren't sure, use this one.
Team:Distributed
Meta label for distributed team
We have a production cluster containing several hundred shards distributed between several datacenters. In case of a datacenter failure it takes 10-15 min for the cluster to converge and resume normal functioning, which is not quite acceptable.
Stack trace analysis shows that all
transport_server_worker.default
threads are blocked onTaskBatcher.submitTasks()
during that downtime interval:At the same time
TaskBatcher
queue contains tens of thousands tasks.There are two major causes for the bottleneck:
TaskBatcher
effectively serializes attempts to submit state update tasks due to use ofserialized(tasksPerBatchingKey)
TaskBatcher
compares each added task to all the existing tasks inO(n)
to ensure no duplicate existsThe problem can be easily reproduced using
v5.6.11
release.TaskBatcher
remains unchanged inmaster
.The text was updated successfully, but these errors were encountered: