Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enrich-kinesis: Recover from losing lease to a new worker #649

Closed
istreeter opened this issue Jul 1, 2022 · 2 comments
Closed

enrich-kinesis: Recover from losing lease to a new worker #649

istreeter opened this issue Jul 1, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@istreeter
Copy link
Contributor

Currently, if a new instance of enrich-kinesis joins the autoscaling group, then we get exceptions in the logs like this:

software.amazon.kinesis.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard

It is easy to recreate the exception:

  • Start enrich-kinesis with 1 worker and 2 input shards. The worker will process both shards.
  • Generate a constant stream of incoming events, to keep the worker busy.
  • Scale out the autoscaling group to add a second worker.

The second worker steals 1 lease from the original worker. The original worker then fails to checkpoint because it doesn't own one of the leases. The worker crashes and restarts.

Things get worse and worse.... when a worker crashes, the other worker steals its lease. When the crashed worker restarts, it steals a lease back again, causing more exceptions in the worker that loses the lease. We end up in a irrecoverable spin where each worker cannot stay alive.

By the way, this has nothing to do with the KCL setting for failover time. This is not about workers failing to check in; it's just the regular lease stealing mechanism.

The solution

We need to catch and handle exceptions here when checkpointing. A sensible exception to catch is KinesisClientLibNonRetryableException. As the name suggests, there is no point in retrying this exception if it happens. We should just ignore it and continue.

@colmsnowplow
Copy link
Contributor

Recently enough we resolved a very similar issue in our fork of the golang kinesis consumer we use in stream-replicator.

The mechanisms of the two may be different, but I'll be happy to discuss the various problems and solutions I encountered if that would be helpful. :)

@istreeter
Copy link
Contributor Author

A sensible exception to catch is KinesisClientLibNonRetryableException

I've changed my mind about this -- I think it's better to catch the more specific ShutdownException because that's the only one we are expecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants