Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat to detect down slaves #927

Merged
merged 24 commits into from
Feb 6, 2019
Merged

Conversation

Jonnymcc
Copy link
Contributor

@Jonnymcc Jonnymcc commented Dec 4, 2018

I have noticed some issues bringing up master and slave nodes within Kubernetes. I noticed in some other github issues for Locust that some of the issues may be solved by adding a heartbeat between the nodes.

The changes include the addition of a heartbeat thread on the master. The slaves check in with the master by means of their own heartbeat worker thread. There is also a new state (missing) that will show in the UI if a slave fails to check in with the heartbeat to the master.

I also think that it prevents some stale state by having the slaves also send their current state to the master. Reason being that if a slave fails to check in and goes missing, how else would the master know what the slave is doing after it continues to check in.

I got my inspiration from the ZMQ guide, specifically this pattern...
http://zguide.zeromq.org/page:all#Robust-Reliable-Queuing-Paranoid-Pirate-Pattern

Thoughts, comments?

The PUSH and PULL sockets being used caused hatch messages to get routed
to slaves that may have become unresponsive or crashed. This change
includes the client id in the messages sent out from the master which
ensures that hatch messages are going to slaves the are READY or
RUNNING.

This should also fix the issue locustio#911 where slaves are not receiving the
stop message. I think these issues are a result of PUSH-PULL sockets
using a round robin approach.
Jonathan McCall added 4 commits December 9, 2018 21:12
The server checks to see if clients have expired and if they have
updates their status to "missing".

The client has a worker that will send a heartbeat on a regular
interval. The heart also relays the slave state back to the
master so that they stay in sync.
Wait until all slaves are reporting in as ready before stating
that the master is stopped.
@codecov
Copy link

codecov bot commented Dec 10, 2018

Codecov Report

Merging #927 into master will increase coverage by 0.97%.
The diff coverage is 61.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #927      +/-   ##
==========================================
+ Coverage   66.55%   67.52%   +0.97%     
==========================================
  Files          14       14              
  Lines        1438     1472      +34     
  Branches      226      230       +4     
==========================================
+ Hits          957      994      +37     
+ Misses        430      429       -1     
+ Partials       51       49       -2
Impacted Files Coverage Δ
locust/rpc/zmqrpc.py 100% <100%> (+59.09%) ⬆️
locust/runners.py 51.5% <41.02%> (+0.19%) ⬆️
locust/core.py 85.64% <0%> (+1.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a155ed...25272c6. Read the comment docs.

Jonathan McCall added 10 commits December 11, 2018 10:27
I think this looks better than using msg[1].
Using parse_options during test setup can conflict with test runners
like pytest. Essentially it will swallow up the options that are
meant to be passed to the test runner and instead treats them
as options being passed to the test.
Coverage breaks with gevent and does not fully report green threads
as having been tested. Setting concurrency in .coveragerc will
fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken
@myzhan
Copy link
Contributor

myzhan commented Dec 27, 2018

Hi, @Jonnymcc

It's a very useful feature. So far, the master can not detect slaves that exit unexpectedly.

But, can you keep compatible with previous versions of locust and third party tools?

What about binding to another port?

@Jonnymcc
Copy link
Contributor Author

@myzhan I think we can support previous versions. Let me know if I understand what is needed though.

  1. Removing the second port(master_port + 1) is not important. If it is important (collecting stats from slaves) why can't the master bind port be used?
  2. The issue stems from the changes to the locust zmq server/client, specifically the addition of the multipart send and receive.
  3. Supporting third party tools can be accomplished by maintaining the same send and receive methods of the zmq server/client and returning the Message type and not returning the multiple frames (a list).

By compatible with previous versions are you implying running a master/client with a master/client of another version? That sounds like asking for bugs regardless of the changes made in the PR.

@cgoldberg cgoldberg merged commit a8c0d7d into locustio:master Feb 6, 2019
@cgoldberg
Copy link
Member

@Jonnymcc 👍 this is really nice!
sorry for taking so long to review the PR... I just merged it to master.
thanks for the contribution to Locust!

@mangatmodi
Copy link

@cgoldberg Any plans for release? I checked last release was in September, 2018. This is quite a major fix, do you think we should have a release now?

@aldenpeterson-wf
Copy link
Contributor

@mangatmodi this is now released 🎉

@tsykora-verimatrix
Copy link

I guess the fix to stop the test after hitting the stop button in the UI did not make it to the 0.11.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants