Fix possible faulty metrics in TestFuncProducing* #1545
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bugfix
Functional tests
TestFuncProducing*
can fail because of "faulty" metric recorded when one of Kafka brokers is not reachable.This happened when building #1539 as seen from the following logs (https://travis-ci.org/Shopify/sarama/jobs/612081678#L12183-L12217):
The root cause is likely that broker number 1 stopped somehow and was used as first seed broker in the failed test case:
Such a metadata request failed not because of a connection refused but because the TCP connection was closed prematurely because we are using toxiproxy.
I believe the
connect
andwrite
on the TCP socket worked but theread
failed immediately with no bytes read (meaning arequest-latency-in-ms
metric for that broker was recorded with 0 ms and aresponse-size metric
was recorded with 0 B).Because the cluster metrics include all requests including the one to
localhost:9091
the minimum latency and response size were recorded with0
instead of at least1
hence the failure.Versions
Changes
Do not check the
request-latency-in-ms
metric for the cluster as it could be 0 if one of the broker is down. Similarly theresponse-size
metric for the cluster could be recorded with 0 bytes.The respective broker metrics on the other hand are useful.
Also dump applicative logs from all brokers in case the tests (or something other build task) fail as it can be useful to identify the root cause.
Testing done
Start all brokers using Vagrant to run functional tests:
Stop broker number 2:
Use broker number 2 and 3 as seed brokers when running tests to reproduce the issue (might need to be run a few time for broker number 2 to be used as the first seed broker):
Same scenario with proposed patch:
And to test the dump of broker logs (note that
kafka-server-stop.sh
does not work properly because of KAFKA-4931 plus we have more than one broker running anyway):And in case of a successful build there are no broker logs: