-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve race conditions #13
Conversation
@jbeemster I've addressed your comments - fixed clerical errors, used a mutex to lock unbecomeLeader() for threadsafety, added a mechanism to exit the shard consumer's On the last one note that cp.dirty signifies that the sequenceNumber in the checkpointer has not yet been committed to the DB, so we check both that this is false and that it's the last seqenceNumber. For your sanity I'll hold off on rebasing anything, and leave comments unresolved, so that it's possible to keep track of the feedback so far. If I can help navigate/refresh memories again just shout, happy to walk through this as many times as necessary to build confidence (same goes for you @paulboocock ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have nothing extra to add here. LGTM.
861e45a
to
4cd0340
Compare
This PR is a follow up to the review of #9, with some tweaks to those changes, as well as changes to address the causes of test failures which remained at that stage.
Apologies that I couldn't make this PR easier to pick up and review from where we left the review of that PR. This is a complex one and I've done my best to make it as clear as possible. In that vein, here's what's in this PR:
All the major points from the last PR are kept
As a refresher, those issues are summarised in #3, with further specifics in #6, and #5.
Some tweaks to these have been made, nothing major.
Feedback from the previous PR is addressed
These are all just nits, eg. renaming
restartingConsumers
toisRestartingConsumers
.New changes introduced
Tests added to isolate shard consumer behaviours
Shard consumer's
consume()
function changed to delay before exitingDescription here: #11.
In sum, duplicates were caused by this function exiting immediately, without enough time for any existing records committing their sequenceNumber. The next time the shard is consumed, these records are grabbed again. Solved by adding a commit loop before exit, which times out after maxAgeForClientRecord/2.
This may introduce latency but hopefully not much.
commit()
to dynamo DB if they are healthy, but there is no dataThis one is a bit tricky, and is the change I'm least confident in (I do think it's fine, but would welcome scrutiny)
Summary:
The source of many of the issues we've uncovered is ownership of shards. When those shards don't have any new data for a period of time longer than
clientRecordMaxAge()
, kinsumer will treat the client as stale, and will stop and restart the consumers. This increases the chances of encountering some ownership problem.Solution:
The tricky part is that we need to keep updating DDB when there's no data, but we need to stop updating DDB if there's some other issue. So, we keep record of the timestamp every time a new record arrives to shard consumer. If this timestamp is recent, we don't update the table, as we expect that record to checkpoint (and therefore trigger a natural commit to DDB). If it isn't recent, we do update DDB (since in this scenario the client is healthy it just has no data).