Resolve race conditions #13

colmsnowplow · 2021-12-17T16:27:07Z

This PR is a follow up to the review of #9, with some tweaks to those changes, as well as changes to address the causes of test failures which remained at that stage.

Apologies that I couldn't make this PR easier to pick up and review from where we left the review of that PR. This is a complex one and I've done my best to make it as clear as possible. In that vein, here's what's in this PR:

All the major points from the last PR are kept

As a refresher, those issues are summarised in #3, with further specifics in #6, and #5.

Some tweaks to these have been made, nothing major.

Feedback from the previous PR is addressed

These are all just nits, eg. renaming restartingConsumers to isRestartingConsumers.

New changes introduced

Tests added to isolate shard consumer behaviours
Shard consumer's consume() function changed to delay before exiting

Description here: #11.

In sum, duplicates were caused by this function exiting immediately, without enough time for any existing records committing their sequenceNumber. The next time the shard is consumed, these records are grabbed again. Solved by adding a commit loop before exit, which times out after maxAgeForClientRecord/2.

This may introduce latency but hopefully not much.

Clients changed to register commit() to dynamo DB if they are healthy, but there is no data

This one is a bit tricky, and is the change I'm least confident in (I do think it's fine, but would welcome scrutiny)

Summary:
The source of many of the issues we've uncovered is ownership of shards. When those shards don't have any new data for a period of time longer than clientRecordMaxAge(), kinsumer will treat the client as stale, and will stop and restart the consumers. This increases the chances of encountering some ownership problem.

Solution:
The tricky part is that we need to keep updating DDB when there's no data, but we need to stop updating DDB if there's some other issue. So, we keep record of the timestamp every time a new record arrives to shard consumer. If this timestamp is recent, we don't update the table, as we expect that record to checkpoint (and therefore trigger a natural commit to DDB). If it isn't recent, we do update DDB (since in this scenario the client is healthy it just has no data).

.github/workflows/ci.yml

Makefile

leader.go

shard_consumer.go

colmsnowplow · 2022-01-26T11:18:24Z

@jbeemster I've addressed your comments - fixed clerical errors, used a mutex to lock unbecomeLeader() for threadsafety, added a mechanism to exit the shard consumer's consume() function as soon as we have checkpointed the latest sequence number that we passed to the user.

On the last one note that cp.dirty signifies that the sequenceNumber in the checkpointer has not yet been committed to the DB, so we check both that this is false and that it's the last seqenceNumber.

For your sanity I'll hold off on rebasing anything, and leave comments unresolved, so that it's possible to keep track of the feedback so far.

If I can help navigate/refresh memories again just shout, happy to walk through this as many times as necessary to build confidence (same goes for you @paulboocock ).

paulboocock

I have nothing extra to add here. LGTM.

…ashes

colmsnowplow requested review from jbeemster and paulboocock December 17, 2021 16:27

colmsnowplow mentioned this pull request Jan 17, 2022

Race conditions #9

Closed

jbeemster reviewed Jan 17, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

jbeemster reviewed Jan 17, 2022

View reviewed changes

Makefile Outdated Show resolved Hide resolved

jbeemster reviewed Jan 17, 2022

View reviewed changes

leader.go Show resolved Hide resolved

jbeemster reviewed Jan 17, 2022

View reviewed changes

shard_consumer.go Show resolved Hide resolved

jbeemster approved these changes Jan 26, 2022

View reviewed changes

paulboocock approved these changes Jan 28, 2022

View reviewed changes

colmsnowplow added 9 commits February 1, 2022 12:42

Fix integration tests and run on GH actions

57ca902

#2

Add configuration of clientRecordMaxAge

9e3d391

#7

Avoid race condition in unbecomeLeader()

c6dac83

#8

Gracefully handle ownership clash errors

3952ff4

#5

Modify getClients() and shard refresh behaviour to avoid ownership cl…

e4b71f3

…ashes

run go mod tidy to fix errors

828d2bf

Add unit tests for shard consumer issues

64c7828

#10

Add fix for duplicates caused by consume exiting too early

a93817e

#11

Update checkpointer to commit periodically when there's no data

4cd0340

#12

colmsnowplow force-pushed the resolve-race-conditions branch from 861e45a to 4cd0340 Compare February 1, 2022 12:55

colmsnowplow merged commit 4e19415 into master Feb 1, 2022

colmsnowplow deleted the resolve-race-conditions branch February 1, 2022 15:32

colmsnowplow mentioned this pull request Apr 26, 2022

Update kinsumer fork to 1.3.0 snowplow/snowbridge#73

Closed

This was referenced Jul 1, 2022

Resolve race conditions in shard ownership #3

Closed

Migrate from travis to GH actions #4

Closed

Update integration testing setup #2

Closed

enrich-kinesis: Recover from losing lease to a new worker snowplow/enrich#649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve race conditions #13

Resolve race conditions #13

colmsnowplow commented Dec 17, 2021 •

edited

Loading

colmsnowplow commented Jan 26, 2022 •

edited

Loading

paulboocock left a comment

Resolve race conditions #13

Resolve race conditions #13

Conversation

colmsnowplow commented Dec 17, 2021 • edited Loading

colmsnowplow commented Jan 26, 2022 • edited Loading

paulboocock left a comment

Choose a reason for hiding this comment

colmsnowplow commented Dec 17, 2021 •

edited

Loading

colmsnowplow commented Jan 26, 2022 •

edited

Loading