Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implementation of heartbeat mechanism as part of KLIP-12 #4173

Merged
merged 2 commits into from
Jan 16, 2020

Conversation

vpapavas
Copy link
Member

@vpapavas vpapavas commented Dec 19, 2019

Description

Initial implementation of heartbeat mechanism as part of #3959

Added two new endpoints, /heartbeat used for registering received heartbeats and /clusterStatus used for reporting current cluster status. All the logic is in HeartbeatHandler.

Determining whether server is UP or Down
Currently, the policy to decide whether a node is UP or DOWN is simplistic and will need to evolve: If a server misses more than KSQL_HEARTBEAT_MISSED_THRESHOLD_CONFIG it is marked as DOWN. However, if it receives even one heartbeat before the window closes, it will be marked as UP. So, it may miss all heartbeats except the last and still be marked as UP. We want to update this by counting the received heartbeats as well and having a threshold on them as well. We will update the policy after performing real-world tests.

Cluster membership
Currently, the cluster membership is done via the persistent queries KStreamsMetadata.

Testing done

Unit tests for the resources and functional test for the HeartbeatHandler.
I have a functional test that has two servers that send heartbeats and check the cluster status response.

Manual testing: Three KSQL servers (A, B and C) on EC2.
Correctness:

  1. Kill server A, see if the other servers (B, C) report it as down by checking the result of curl -sX GET "http://localhost:8088/clusterStatus". Bring the server A up, check again the result of curl to see if the rest of the server see it now as up.
  2. Use iptables to block outgoing traffic of A to B. Use curl to check if B marks A as down. Check if C marks A as up (since A stills ends traffic to C). Restore traffic to B, check if A is marked up.

Time to detect change in status:
300ms to detect A is down. 150 ms to detect it is up again.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vpapavas vpapavas requested a review from a team as a code owner December 19, 2019 01:15
@vinothchandar vinothchandar self-assigned this Dec 19, 2019
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vpapavas! I think the bones are there, but the complexity and synchronization models can be improved IMO. Let me know if you have any questions :)

HostInfo hostInfo = new HostInfo(hostInfoEntity.getHost(), hostInfoEntity.getPort());
long timestamp = request.getTimestamp();

heartbeatHandler.getReceivedHeartbeats().putIfAbsent(hostInfo, new TreeMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to split the producer/consumer data structures of heartbeats. I would have the heartbeat handler maintain a circular buffer of heartbeats that can have many writers and then whenever you runOneIteration (see comment about AbstractScheduledService) you would flush that buffer into the tree map locally. This would probably result in fewer tree rebalances and a much simpler concurrency model. It also makes this method non-blocking (in the case that heartbeats is locked)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had privately suggested all three design choices and we decided to go with the simplest and measure runtime/throughput and then go from there. The requirements to keep in mind are:

  1. The heartbeats per hosts needs to be ordered for processing.
  2. We have multiple producers with out-of-order writes.
  3. We have one consumer.

The choices in order of (reasoning) complexity are:

  1. Synchronize with ConcurrentHashMap + TreeMap
  2. ConcurrentHashMap + ConcurrentNavigableMap
  3. Use read/write locks and unordered concurrent list implementation. It will get ordered and drained by processHeartbeat thread.

Copy link
Contributor

@agavra agavra Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline - I agree that the criteria for choosing a solution should be simplicity, I just think that option 3 is the simplest. You don't have to worry about concurrency on any complicated (map) data structure. The critical section is really stupid, just append at end and poll from start. It's the same idea as Kafka 😂

EDIT: also my suggestion is not to bother with R/W locks - appending to a list is fast enough we can just hard lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this to @vpapavas .

The critical section is really stupid, just append at end and poll from start.

No comments. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in issue #4277.

I feel the first solution might the best because:

With TreeMap
registerHeartbeat: lock, put into TreeMap, unlock (critical section takes O(logN))
processHeartbeat: lock, copy subMap, clear initial map, unlock (critical section takes O(N))

With List
registerHeartbeat: lock, add to List, unlock (critical section takes O(1))
processHeartbeat: lock, order, copy to new list, clear initial list, unlock (critical section takes O(NlogN))

Copy link
Contributor

@agavra agavra Jan 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpapavas three points:

  • With the second option, if you're okay with a double-copy, then the critical section becomes O(N). You just drain the list in the critical section and sort it outside of it.
  • The Ns aren't necessarily the same - the N in the list approach is bounded to the number of heartbeats that come in in one interval, the N in the big TreeMap approach is bounded to all of the heartbeats that you've seen
  • There's also the frequency argument, hypothetically you're receiving heartbeats way more often than you're registering heartbeats (probably a write-heavy system).

But I digress, I'm not sure perf should be the driving factor here but rather ease of implementation. If you feel that the current implementation is simpler, then we should stick to that. It would be a fun exercise to write them both and compare.

@agavra agavra self-assigned this Dec 19, 2019
Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. May be one more spin and we are home!

There are good comments here from @agavra . Lets file issues for these, so we can track them for later, even if don't do this right away?

Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One real question and bunch of cleanup suggestions.. Great job @vpapavas ! Otherwise LGTM

final URI remoteUri = buildLocation(localURL, status.getHostInfoEntity().getHost(),
status.getHostInfoEntity().getPort());
serviceContext.getKsqlClient().makeHeartbeatRequest(remoteUri, localHostInfo,
System.currentTimeMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the heartbeat timestamp is sent in the system's default time zone.. so the UTC issue above would be a real problem..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaa good catch!

prev = ts;
}
// Check frame from last received heartbeat to window end
if (windowEnd - prev - 1 > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fair.. I still have some gaps in understanding... The if- check here, seems to check if the difference between the last heartbeat (which can be windowStart or the real last heartbeat) and the windowEnd is non-zero.. I think this will mostly pass, in normal mode of operation right? There will be some non-zero gap between heartbeat and windowEnd, as I can imagine..

The assignment below, resets the missedCount (presumably since we are interested in consecutive misses only).. I am wondering if thats the right thing to do.. could it be possible that we already accumulated a missedCount in the for loop above and we could reset this to 0? if say heartbeatSendIntervalMs was 10ms and heartbeats arrive perfectly (let say), then windowEnd-prev is say 10ms and we set missedCount = 10-1/10 as 0? even though we detected enough consecutive missed above?

@vpapavas vpapavas force-pushed the heartbeat branch 2 times, most recently from c5ce119 to 86fe4c8 Compare January 14, 2020 04:01
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM - thanks @vpapavas! This is a big step forward in HA :) my comments are all about the code (nothing structural) and a few bugs (I think) that I identified.

@@ -0,0 +1,496 @@
/*
* Copyright 2019 Confluent Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2020 🎉 (all of the new files)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files were created in 2019 :)

}

@Test(timeout = 60000)
public void shouldMarkRemoteServerAsUpThenDownThenUp() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do this in a follow-up PR, but I'm worried that these tests will add a non-trivial amount of time to our testing suite without necessarily adding much coverage. this is, if i'm understanding correctly, trying to cover the algorithm that computes missing intervals. the HeartbeatAgentTest already does this. so while this test is great during development to make sure that things work, I don't know if we need it in our testing suite

also any test that involves timing is prone to flakiness - can you please run this test at least 50 times and make sure it passes 100% of the time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't actually check the algorithm for missing heartbeats but rather the whole heartbeating process. It uses the /clusterStatus endpoint and all three services to check whether servers are reported alive or dead correctly. It tests the interplay of all three services

when(query1.getAllMetadata()).thenReturn(allMetadata1);
when(streamsMetadata1.hostInfo()).thenReturn(remoteHostInfo);

DiscoverClusterService discoverService = heartbeatAgent.new DiscoverClusterService();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah I've never seen this syntax before 😂 that's kinda cool.

Comment on lines 216 to 218
if (shouldRetry(readTimeoutMs, e)) {
postAsync(path, jsonEntity, calcReadTimeout(readTimeoutMs));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's usually better to retry in a loop instead of in the exception block. can you use a framework like Retryer or our RetryUtil to handle this?

Also even if it eventually succeeds, won't it throw because we still fall through...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed what the prior code did here

fixed tests

use application_server config to determine local host address

fixed compile issues

added extra tests

test

fixed failing test

added debug logging, made critical section smaller

addressed almogs comments

added return
@vpapavas vpapavas changed the title feat: Initial implementation of heartbeat mechanism as part of KLIP-12 feat: Implementation of heartbeat mechanism as part of KLIP-12 Jan 16, 2020
@vpapavas vpapavas merged commit 37c1eaa into confluentinc:master Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants