enrich-kinesis: Recover from losing lease to a new worker #649

istreeter · 2022-07-01T16:17:58Z

Currently, if a new instance of enrich-kinesis joins the autoscaling group, then we get exceptions in the logs like this:

software.amazon.kinesis.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard

It is easy to recreate the exception:

Start enrich-kinesis with 1 worker and 2 input shards. The worker will process both shards.
Generate a constant stream of incoming events, to keep the worker busy.
Scale out the autoscaling group to add a second worker.

The second worker steals 1 lease from the original worker. The original worker then fails to checkpoint because it doesn't own one of the leases. The worker crashes and restarts.

Things get worse and worse.... when a worker crashes, the other worker steals its lease. When the crashed worker restarts, it steals a lease back again, causing more exceptions in the worker that loses the lease. We end up in a irrecoverable spin where each worker cannot stay alive.

By the way, this has nothing to do with the KCL setting for failover time. This is not about workers failing to check in; it's just the regular lease stealing mechanism.

The solution

We need to catch and handle exceptions here when checkpointing. A sensible exception to catch is KinesisClientLibNonRetryableException. As the name suggests, there is no point in retrying this exception if it happens. We should just ignore it and continue.

colmsnowplow · 2022-07-01T16:32:00Z

Recently enough we resolved a very similar issue in our fork of the golang kinesis consumer we use in stream-replicator.

The mechanisms of the two may be different, but I'll be happy to discuss the various problems and solutions I encountered if that would be helpful. :)

istreeter · 2022-07-01T16:58:03Z

A sensible exception to catch is KinesisClientLibNonRetryableException

I've changed my mind about this -- I think it's better to catch the more specific ShutdownException because that's the only one we are expecting.

istreeter mentioned this issue Jul 1, 2022

enrich-kinesis: Avoid duplicates when losing lease to a new worker #650

Open

istreeter added a commit that referenced this issue Jul 1, 2022

enrich-kinesis: Recover from losing lease to a new worker (close #649)

7dddb13

istreeter added a commit that referenced this issue Jul 1, 2022

enrich-kinesis: Recover from losing lease to a new worker (close #649)

88c8181

istreeter mentioned this issue Jul 1, 2022

Transformer kinesis: Recover from losing lease to a new worker snowplow/snowplow-rdb-loader#962

Closed

istreeter added the bug Something isn't working label Jul 1, 2022

istreeter closed this as completed in 999325b Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enrich-kinesis: Recover from losing lease to a new worker #649

enrich-kinesis: Recover from losing lease to a new worker #649

istreeter commented Jul 1, 2022

colmsnowplow commented Jul 1, 2022

istreeter commented Jul 1, 2022

enrich-kinesis: Recover from losing lease to a new worker #649

enrich-kinesis: Recover from losing lease to a new worker #649

Comments

istreeter commented Jul 1, 2022

The solution

colmsnowplow commented Jul 1, 2022

istreeter commented Jul 1, 2022