Add heartbeat to detect down slaves #927

Jonnymcc · 2018-12-04T21:47:27Z

I have noticed some issues bringing up master and slave nodes within Kubernetes. I noticed in some other github issues for Locust that some of the issues may be solved by adding a heartbeat between the nodes.

The changes include the addition of a heartbeat thread on the master. The slaves check in with the master by means of their own heartbeat worker thread. There is also a new state (missing) that will show in the UI if a slave fails to check in with the heartbeat to the master.

I also think that it prevents some stale state by having the slaves also send their current state to the master. Reason being that if a slave fails to check in and goes missing, how else would the master know what the slave is doing after it continues to check in.

I got my inspiration from the ZMQ guide, specifically this pattern...
http://zguide.zeromq.org/page:all#Robust-Reliable-Queuing-Paranoid-Pirate-Pattern

Thoughts, comments?

The PUSH and PULL sockets being used caused hatch messages to get routed to slaves that may have become unresponsive or crashed. This change includes the client id in the messages sent out from the master which ensures that hatch messages are going to slaves the are READY or RUNNING. This should also fix the issue locustio#911 where slaves are not receiving the stop message. I think these issues are a result of PUSH-PULL sockets using a round robin approach.

The server checks to see if clients have expired and if they have updates their status to "missing". The client has a worker that will send a heartbeat on a regular interval. The heart also relays the slave state back to the master so that they stay in sync.

Wait until all slaves are reporting in as ready before stating that the master is stopped.

codecov · 2018-12-10T15:26:48Z

Codecov Report

Merging #927 into master will increase coverage by 0.97%.
The diff coverage is 61.01%.

@@            Coverage Diff             @@
##           master     #927      +/-   ##
==========================================
+ Coverage   66.55%   67.52%   +0.97%     
==========================================
  Files          14       14              
  Lines        1438     1472      +34     
  Branches      226      230       +4     
==========================================
+ Hits          957      994      +37     
+ Misses        430      429       -1     
+ Partials       51       49       -2

Impacted Files	Coverage Δ
locust/rpc/zmqrpc.py	`100% <100%> (+59.09%)`	⬆️
locust/runners.py	`51.5% <41.02%> (+0.19%)`	⬆️
locust/core.py	`85.64% <0%> (+1.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a155ed...25272c6. Read the comment docs.

I think this looks better than using msg[1].

Using parse_options during test setup can conflict with test runners like pytest. Essentially it will swallow up the options that are meant to be passed to the test runner and instead treats them as options being passed to the test.

Coverage breaks with gevent and does not fully report green threads as having been tested. Setting concurrency in .coveragerc will fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken

Checks that start_hatching sends messages to ready, running, and hatching slaves.

myzhan · 2018-12-27T11:16:56Z

Hi, @Jonnymcc

It's a very useful feature. So far, the master can not detect slaves that exit unexpectedly.

But, can you keep compatible with previous versions of locust and third party tools?

What about binding to another port?

Jonnymcc · 2018-12-28T17:03:47Z

@myzhan I think we can support previous versions. Let me know if I understand what is needed though.

Removing the second port(master_port + 1) is not important. If it is important (collecting stats from slaves) why can't the master bind port be used?
The issue stems from the changes to the locust zmq server/client, specifically the addition of the multipart send and receive.
Supporting third party tools can be accomplished by maintaining the same send and receive methods of the zmq server/client and returning the Message type and not returning the multiple frames (a list).

By compatible with previous versions are you implying running a master/client with a master/client of another version? That sounds like asking for bugs regardless of the changes made in the PR.

cgoldberg · 2019-02-06T20:15:24Z

@Jonnymcc 👍 this is really nice!
sorry for taking so long to review the PR... I just merged it to master.
thanks for the contribution to Locust!

mangatmodi · 2019-03-11T22:52:50Z

@cgoldberg Any plans for release? I checked last release was in September, 2018. This is quite a major fix, do you think we should have a release now?

aldenpeterson-wf · 2019-03-15T19:50:12Z

@mangatmodi this is now released 🎉

tsykora-verimatrix · 2019-09-05T07:10:23Z

I guess the fix to stop the test after hitting the stop button in the UI did not make it to the 0.11.0 release

Jonnymcc force-pushed the heartbeat branch from 9dbe76f to 3d4b927 Compare December 10, 2018 01:55

Jonathan McCall added 4 commits December 9, 2018 21:12

Remove client_id parameter from send_multipart method

095acbd

Use new clients.all property in heartbeat worker

3ff4080

Fix reporting of stopped state

28b0bc9

Wait until all slaves are reporting in as ready before stating that the master is stopped.

Jonathan McCall added 3 commits December 11, 2018 10:19

Fix tests after changing ZMQ sockets to DEALER-ROUTER

7b5e62d

Change heartbeat log msg to info so that it does not appear in tests

d019117

Add tests for zmqrpc.py

a5c768c

Jonnymcc force-pushed the heartbeat branch from 7a23ac7 to a5c768c Compare December 11, 2018 15:20

Jonathan McCall added 10 commits December 11, 2018 10:27

Remove commented imports, add note about sleep

7e1c8e9

Support str/unicode diff in py2 vs py3

6984741

Ensure failed zmqrpc tests clean up bound sockets

197949a

Create throw away variable for identity from from ZMQ message

e617573

I think this looks better than using msg[1].

Set coverage concurrency to gevent

5a416fc

Coverage breaks with gevent and does not fully report green threads as having been tested. Setting concurrency in .coveragerc will fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken

Add test that shows master heartbeat worker marks slaves missing

9818dd3

Add assertions to test_zmqrpc.py

7c0d155

Use unittest assertions

bd56d22

Change assertion value to bytes object

e757591

This was referenced Dec 11, 2018

Adding TCP Keep Alive to guarantee master-slave communication #740

Closed

Web UI does not stop slave servers #911

Closed

Jonathan McCall added 6 commits December 17, 2018 18:14

Add cmdline options for heartbeat liveness and interval

5854b60

Add new option heartbeat_liveness to test_runners mock options

493db4c

Ensure SlaveNode class uses heartbeat_liveness default or passed

72f16cb

Ensure hatch data can be updated for slaves currently hatching

50b8e7d

Add test for start hatching accepted slave states

bb9b473

Checks that start_hatching sends messages to ready, running, and hatching slaves.

Remove unneeded imports of mock

25272c6

mangatmodi mentioned this pull request Jan 23, 2019

Bug: Locust master doesn't remove killed slave #951

Closed

cgoldberg merged commit a8c0d7d into locustio:master Feb 6, 2019

solowalker27 mentioned this pull request Mar 15, 2019

Unable to stop load from web UI with 0.11.0 #981

Closed

AnotherDevBoy mentioned this pull request May 29, 2019

Adding TCP Keep Alive to guarantee master-slave communication after idle periods #1020

Closed

orisano mentioned this pull request Feb 6, 2020

docs: remove the description of port 5558 #1255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heartbeat to detect down slaves #927

Add heartbeat to detect down slaves #927

Jonnymcc commented Dec 4, 2018 •

edited

Loading

codecov bot commented Dec 10, 2018 •

edited

Loading

myzhan commented Dec 27, 2018

Jonnymcc commented Dec 28, 2018

cgoldberg commented Feb 6, 2019

mangatmodi commented Mar 11, 2019

aldenpeterson-wf commented Mar 15, 2019

tsykora-verimatrix commented Sep 5, 2019

Add heartbeat to detect down slaves #927

Add heartbeat to detect down slaves #927

Conversation

Jonnymcc commented Dec 4, 2018 • edited Loading

codecov bot commented Dec 10, 2018 • edited Loading

Codecov Report

myzhan commented Dec 27, 2018

Jonnymcc commented Dec 28, 2018

cgoldberg commented Feb 6, 2019

mangatmodi commented Mar 11, 2019

aldenpeterson-wf commented Mar 15, 2019

tsykora-verimatrix commented Sep 5, 2019

Jonnymcc commented Dec 4, 2018 •

edited

Loading

codecov bot commented Dec 10, 2018 •

edited

Loading