Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve race conditions #13

Merged
merged 9 commits into from
Feb 1, 2022
Merged

Resolve race conditions #13

merged 9 commits into from
Feb 1, 2022

Conversation

colmsnowplow
Copy link

@colmsnowplow colmsnowplow commented Dec 17, 2021

This PR is a follow up to the review of #9, with some tweaks to those changes, as well as changes to address the causes of test failures which remained at that stage.

Apologies that I couldn't make this PR easier to pick up and review from where we left the review of that PR. This is a complex one and I've done my best to make it as clear as possible. In that vein, here's what's in this PR:


All the major points from the last PR are kept

As a refresher, those issues are summarised in #3, with further specifics in #6, and #5.

Some tweaks to these have been made, nothing major.


Feedback from the previous PR is addressed

These are all just nits, eg. renaming restartingConsumers to isRestartingConsumers.


New changes introduced

  • Tests added to isolate shard consumer behaviours

  • Shard consumer's consume() function changed to delay before exiting

Description here: #11.

In sum, duplicates were caused by this function exiting immediately, without enough time for any existing records committing their sequenceNumber. The next time the shard is consumed, these records are grabbed again. Solved by adding a commit loop before exit, which times out after maxAgeForClientRecord/2.

This may introduce latency but hopefully not much.

  • Clients changed to register commit() to dynamo DB if they are healthy, but there is no data

This one is a bit tricky, and is the change I'm least confident in (I do think it's fine, but would welcome scrutiny)

Summary:
The source of many of the issues we've uncovered is ownership of shards. When those shards don't have any new data for a period of time longer than clientRecordMaxAge(), kinsumer will treat the client as stale, and will stop and restart the consumers. This increases the chances of encountering some ownership problem.

Solution:
The tricky part is that we need to keep updating DDB when there's no data, but we need to stop updating DDB if there's some other issue. So, we keep record of the timestamp every time a new record arrives to shard consumer. If this timestamp is recent, we don't update the table, as we expect that record to checkpoint (and therefore trigger a natural commit to DDB). If it isn't recent, we do update DDB (since in this scenario the client is healthy it just has no data).

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
leader.go Show resolved Hide resolved
@colmsnowplow
Copy link
Author

colmsnowplow commented Jan 26, 2022

@jbeemster I've addressed your comments - fixed clerical errors, used a mutex to lock unbecomeLeader() for threadsafety, added a mechanism to exit the shard consumer's consume() function as soon as we have checkpointed the latest sequence number that we passed to the user.

On the last one note that cp.dirty signifies that the sequenceNumber in the checkpointer has not yet been committed to the DB, so we check both that this is false and that it's the last seqenceNumber.

For your sanity I'll hold off on rebasing anything, and leave comments unresolved, so that it's possible to keep track of the feedback so far.

If I can help navigate/refresh memories again just shout, happy to walk through this as many times as necessary to build confidence (same goes for you @paulboocock ).

Copy link

@paulboocock paulboocock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing extra to add here. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants