Fix possible faulty metrics in TestFuncProducing* #1545

slaunay · 2019-11-20T18:17:07Z

Bugfix

Functional tests TestFuncProducing* can fail because of "faulty" metric recorded when one of Kafka brokers is not reachable.

This happened when building #1539 as seen from the following logs (https://travis-ci.org/Shopify/sarama/jobs/612081678#L12183-L12217):

=== RUN   TestFuncProducingSnappy
[sarama] 2019/11/14 21:59:11 Initializing new client
[sarama] 2019/11/14 21:59:11 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/14 21:59:11 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/14 21:59:11 client/metadata fetching metadata for all topics from broker localhost:9091
INFO[0376] Accepted client                               client=127.0.0.1:49120 name=kafka1 proxy=[::]:9091 upstream=localhost:29091
ERRO[0376] Unable to open connection to upstream         client=127.0.0.1:49120 name=kafka1 proxy=[::]:9091 upstream=localhost:29091
[sarama] 2019/11/14 21:59:11 Connected to broker at localhost:9091 (unregistered)
[sarama] 2019/11/14 21:59:11 client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2019/11/14 21:59:11 Closed connection to broker localhost:9091
[sarama] 2019/11/14 21:59:11 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/14 21:59:11 client/metadata fetching metadata for all topics from broker localhost:9092
INFO[0376] Accepted client                               client=127.0.0.1:32824 name=kafka2 proxy=[::]:9092 upstream=localhost:29092
[sarama] 2019/11/14 21:59:11 Connected to broker at localhost:9092 (unregistered)
[sarama] 2019/11/14 21:59:11 client/brokers registered new broker #9093 at localhost:9093
[sarama] 2019/11/14 21:59:11 client/brokers registered new broker #9092 at localhost:9092
[sarama] 2019/11/14 21:59:11 client/brokers registered new broker #9094 at localhost:9094
[sarama] 2019/11/14 21:59:11 client/brokers registered new broker #9095 at localhost:9095
[sarama] 2019/11/14 21:59:11 Successfully initialized new client
[sarama] 2019/11/14 21:59:11 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
INFO[0376] Accepted client                               client=127.0.0.1:48224 name=kafka5 proxy=[::]:9095 upstream=localhost:29095
[sarama] 2019/11/14 21:59:11 Connected to broker at localhost:9095 (registered as #9095)
[sarama] 2019/11/14 21:59:11 producer/broker/9095 starting up
[sarama] 2019/11/14 21:59:11 producer/broker/9095 state change to [open] on test.1/0
[sarama] 2019/11/14 21:59:12 Producer shutting down.
[sarama] 2019/11/14 21:59:12 producer/broker/9095 input chan closed
[sarama] 2019/11/14 21:59:12 producer/broker/9095 shut down
[sarama] 2019/11/14 21:59:12 consumer/broker/9095 added subscription to test.1/0
[sarama] 2019/11/14 21:59:12 consumer/broker/9095 closed dead subscription to test.1/0
[sarama] 2019/11/14 21:59:12 Closing Client
WARN[0377] Source terminated                             bytes=18356 err=read tcp 127.0.0.1:49024->127.0.0.1:29095: use of closed network connection name=kafka5
[sarama] 2019/11/14 21:59:12 Closed connection to broker localhost:9095
[sarama] 2019/11/14 21:59:12 Closed connection to broker localhost:9092
WARN[0377] Source terminated                             bytes=7360 err=read tcp 127.0.0.1:59590->127.0.0.1:29092: use of closed network connection name=kafka2
--- FAIL: TestFuncProducingSnappy (1.03s)
    metrics_test.go:160: Expected histogram metric 'request-latency-in-ms' min >= 10, got 0
    metrics_test.go:160: Expected histogram metric 'response-size' min >= 1, got 0

The root cause is likely that broker number 1 stopped somehow and was used as first seed broker in the failed test case:

[sarama] 2019/11/14 21:59:11 client/metadata fetching metadata for all topics from broker localhost:9091
INFO[0376] Accepted client                               client=127.0.0.1:49120 name=kafka1 proxy=[::]:9091 upstream=localhost:29091
ERRO[0376] Unable to open connection to upstream         client=127.0.0.1:49120 name=kafka1 proxy=[::]:9091 upstream=localhost:29091
[sarama] 2019/11/14 21:59:11 Connected to broker at localhost:9091 (unregistered)
[sarama] 2019/11/14 21:59:11 client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2019/11/14 21:59:11 Closed connection to broker localhost:9091

Such a metadata request failed not because of a connection refused but because the TCP connection was closed prematurely because we are using toxiproxy.
I believe the connect and write on the TCP socket worked but the read failed immediately with no bytes read (meaning a request-latency-in-ms metric for that broker was recorded with 0 ms and a response-size metric was recorded with 0 B).
Because the cluster metrics include all requests including the one to localhost:9091 the minimum latency and response size were recorded with 0 instead of at least 1 hence the failure.

Versions

Sarama	Kafka	Go
`bb74e49`	N/A	1.13

Changes

Do not check the request-latency-in-ms metric for the cluster as it could be 0 if one of the broker is down. Similarly the response-size metric for the cluster could be recorded with 0 bytes.
The respective broker metrics on the other hand are useful.

Also dump applicative logs from all brokers in case the tests (or something other build task) fail as it can be useful to identify the root cause.

Testing done

Start all brokers using Vagrant to run functional tests:

$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
...
    default: finished produceWithDifferentProducers
    default:
    default: simplest uncommitted message producer finished

Stop broker number 2:

$ vagrant ssh -c "pgrep -f kafkaServer.*kafka-9092"
11893
Connection to 127.0.0.1 closed.
$ vagrant ssh -c "sudo pkill -f kafkaServer.*kafka-9092"
Connection to 127.0.0.1 closed.
$ vagrant ssh -c "pgrep -f kafkaServer.*kafka-9092"
Connection to 127.0.0.1 closed.

Use broker number 2 and 3 as seed brokers when running tests to reproduce the issue (might need to be run a few time for broker number 2 to be used as the first seed broker):

$ KAFKA_PEERS=192.168.100.67:9092,192.168.100.67:9093 DEBUG=true go test -run=TestFuncProducingSnappy -v
[sarama] 2019/11/20 10:00:55 Using random seed: 1574272855832395000
=== RUN   TestFuncProducingSnappy
[sarama] 2019/11/20 10:00:55 Initializing new client
[sarama] 2019/11/20 10:00:55 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:00:55 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:00:55 client/metadata fetching metadata for all topics from broker 192.168.100.67:9092
[sarama] 2019/11/20 10:00:55 Connected to broker at 192.168.100.67:9092 (unregistered)
[sarama] 2019/11/20 10:00:55 client/metadata got error from broker -1 while fetching metadata: read tcp 192.168.100.1:62021->192.168.100.67:9092: read: connection reset by peer
[sarama] 2019/11/20 10:00:55 Closed connection to broker 192.168.100.67:9092
[sarama] 2019/11/20 10:00:55 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:00:55 client/metadata fetching metadata for all topics from broker 192.168.100.67:9093
[sarama] 2019/11/20 10:00:55 Connected to broker at 192.168.100.67:9093 (unregistered)
[sarama] 2019/11/20 10:00:55 client/brokers registered new broker #9091 at 192.168.100.67:9091
[sarama] 2019/11/20 10:00:55 client/brokers registered new broker #9094 at 192.168.100.67:9094
[sarama] 2019/11/20 10:00:55 client/brokers registered new broker #9093 at 192.168.100.67:9093
[sarama] 2019/11/20 10:00:55 client/brokers registered new broker #9095 at 192.168.100.67:9095
[sarama] 2019/11/20 10:00:55 Successfully initialized new client
[sarama] 2019/11/20 10:00:55 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:00:55 Connected to broker at 192.168.100.67:9094 (registered as #9094)
[sarama] 2019/11/20 10:00:55 producer/broker/9094 starting up
[sarama] 2019/11/20 10:00:55 producer/broker/9094 state change to [open] on test.1/0
[sarama] 2019/11/20 10:00:55 Producer shutting down.
[sarama] 2019/11/20 10:00:55 producer/broker/9094 input chan closed
[sarama] 2019/11/20 10:00:55 producer/broker/9094 shut down
[sarama] 2019/11/20 10:00:55 consumer/broker/9094 added subscription to test.1/0
[sarama] 2019/11/20 10:00:56 consumer/broker/9094 closed dead subscription to test.1/0
[sarama] 2019/11/20 10:00:56 Closing Client
[sarama] 2019/11/20 10:00:56 Closed connection to broker 192.168.100.67:9094
[sarama] 2019/11/20 10:00:56 Closed connection to broker 192.168.100.67:9093
--- FAIL: TestFuncProducingSnappy (0.39s)
    metrics_test.go:160: Expected histogram metric 'request-latency-in-ms' min >= 10, got 0
    metrics_test.go:160: Expected histogram metric 'response-size' min >= 1, got 0
FAIL
exit status 1
FAIL    github.com/Shopify/sarama       0.539s

Same scenario with proposed patch:

$ KAFKA_PEERS=192.168.100.67:9092,192.168.100.67:9093 DEBUG=true go test -run=TestFuncProducingSnappy -v
[sarama] 2019/11/20 10:02:33 Using random seed: 1574272953410242000
=== RUN   TestFuncProducingSnappy
[sarama] 2019/11/20 10:02:33 Initializing new client
[sarama] 2019/11/20 10:02:33 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:02:33 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:02:33 client/metadata fetching metadata for all topics from broker 192.168.100.67:9092
[sarama] 2019/11/20 10:02:33 Connected to broker at 192.168.100.67:9092 (unregistered)
[sarama] 2019/11/20 10:02:33 client/metadata got error from broker -1 while fetching metadata: read tcp 192.168.100.1:62064->192.168.100.67:9092: read: connection reset by peer
[sarama] 2019/11/20 10:02:33 Closed connection to broker 192.168.100.67:9092
[sarama] 2019/11/20 10:02:33 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:02:33 client/metadata fetching metadata for all topics from broker 192.168.100.67:9093
[sarama] 2019/11/20 10:02:33 Connected to broker at 192.168.100.67:9093 (unregistered)
[sarama] 2019/11/20 10:02:33 client/brokers registered new broker #9091 at 192.168.100.67:9091
[sarama] 2019/11/20 10:02:33 client/brokers registered new broker #9094 at 192.168.100.67:9094
[sarama] 2019/11/20 10:02:33 client/brokers registered new broker #9093 at 192.168.100.67:9093
[sarama] 2019/11/20 10:02:33 client/brokers registered new broker #9095 at 192.168.100.67:9095
[sarama] 2019/11/20 10:02:33 Successfully initialized new client
[sarama] 2019/11/20 10:02:33 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama] 2019/11/20 10:02:33 Connected to broker at 192.168.100.67:9094 (registered as #9094)
[sarama] 2019/11/20 10:02:33 producer/broker/9094 starting up
[sarama] 2019/11/20 10:02:33 producer/broker/9094 state change to [open] on test.1/0
[sarama] 2019/11/20 10:02:33 Producer shutting down.
[sarama] 2019/11/20 10:02:33 producer/broker/9094 input chan closed
[sarama] 2019/11/20 10:02:33 producer/broker/9094 shut down
[sarama] 2019/11/20 10:02:33 consumer/broker/9094 added subscription to test.1/0
[sarama] 2019/11/20 10:02:33 consumer/broker/9094 closed dead subscription to test.1/0
[sarama] 2019/11/20 10:02:33 Closing Client
[sarama] 2019/11/20 10:02:33 Closed connection to broker 192.168.100.67:9094
[sarama] 2019/11/20 10:02:33 Closed connection to broker 192.168.100.67:9093
--- PASS: TestFuncProducingSnappy (0.40s)
PASS
ok      github.com/Shopify/sarama       0.477s

And to test the dump of broker logs (note that kafka-server-stop.sh does not work properly because of KAFKA-4931 plus we have more than one broker running anyway):

$ vagrant ssh -c "sudo TRAVIS_TEST_RESULT=1 KAFKA_INSTALL_ROOT=/opt /vagrant/vagrant/halt_cluster.sh"
Dumping Kafka brokers server.log:
kafka-9091 [2019-11-19 22:43:16,810] WARN Client session timed out, have not heard from server in 7726650ms for sessionid 0x16e85405ddd0000 (org.apache.zookeeper.ClientCnxn)
kafka-9091 [2019-11-19 22:43:16,813] INFO Client session timed out, have not heard from server in 7726650ms for sessionid 0x16e85405ddd0000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
...
kafka-9095 [2019-11-20 18:03:52,000] INFO Expiring session 0x56e854052720003, timeout of 4000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
kafka-9095 [2019-11-20 18:03:52,001] INFO Processed session termination for sessionid: 0x56e854052720003 (org.apache.zookeeper.server.PrepRequestProcessor)
+ for i in 1 2 3 4 5
++ expr 1 + 9090
+ KAFKA_PORT=9091
+ cd /opt/kafka-9091
+ bin/kafka-server-stop.sh
No kafka server to stop
Connection to 127.0.0.1 closed.

And in case of a successful build there are no broker logs:

$ vagrant ssh -c "sudo TRAVIS_TEST_RESULT=0 KAFKA_INSTALL_ROOT=/opt /vagrant/vagrant/halt_clust
er.sh"
+ for i in 1 2 3 4 5
++ expr 1 + 9090
+ KAFKA_PORT=9091
+ cd /opt/kafka-9091
+ bin/kafka-server-stop.sh
No kafka server to stop
Connection to 127.0.0.1 closed.

Do not check the request-latency-in-ms metric for the cluster as it could be 0 if one of the broker is down. Similarly the response-size metric for the cluster could be recorded with 0 bytes. The respective broker metrics on the other hand are useful. Dump applicative logs from all brokers in case the tests (or something other build task) fail.

This reverts commit 49a5636.

bai requested a review from d1egoaz November 21, 2019 13:29

varun06 approved these changes Nov 22, 2019

View reviewed changes

varun06 merged commit 49a5636 into IBM:master Nov 22, 2019

souravdotsaha added a commit to souravdotsaha/sarama that referenced this pull request Nov 22, 2019

Revert "Fix possible faulty metrics in TestFuncProducing* (IBM#1545)"

b3ef550

This reverts commit 49a5636.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible faulty metrics in TestFuncProducing* #1545

Fix possible faulty metrics in TestFuncProducing* #1545

slaunay commented Nov 20, 2019

Fix possible faulty metrics in TestFuncProducing* #1545

Fix possible faulty metrics in TestFuncProducing* #1545

Conversation

slaunay commented Nov 20, 2019

Bugfix

Versions

Changes

Testing done