You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if a new instance of enrich-kinesis joins the autoscaling group, then we get exceptions in the logs like this:
software.amazon.kinesis.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
It is easy to recreate the exception:
Start enrich-kinesis with 1 worker and 2 input shards. The worker will process both shards.
Generate a constant stream of incoming events, to keep the worker busy.
Scale out the autoscaling group to add a second worker.
The second worker steals 1 lease from the original worker. The original worker then fails to checkpoint because it doesn't own one of the leases. The worker crashes and restarts.
Things get worse and worse.... when a worker crashes, the other worker steals its lease. When the crashed worker restarts, it steals a lease back again, causing more exceptions in the worker that loses the lease. We end up in a irrecoverable spin where each worker cannot stay alive.
By the way, this has nothing to do with the KCL setting for failover time. This is not about workers failing to check in; it's just the regular lease stealing mechanism.
The solution
We need to catch and handle exceptions here when checkpointing. A sensible exception to catch is KinesisClientLibNonRetryableException. As the name suggests, there is no point in retrying this exception if it happens. We should just ignore it and continue.
The text was updated successfully, but these errors were encountered:
Currently, if a new instance of enrich-kinesis joins the autoscaling group, then we get exceptions in the logs like this:
It is easy to recreate the exception:
The second worker steals 1 lease from the original worker. The original worker then fails to checkpoint because it doesn't own one of the leases. The worker crashes and restarts.
Things get worse and worse.... when a worker crashes, the other worker steals its lease. When the crashed worker restarts, it steals a lease back again, causing more exceptions in the worker that loses the lease. We end up in a irrecoverable spin where each worker cannot stay alive.
By the way, this has nothing to do with the KCL setting for failover time. This is not about workers failing to check in; it's just the regular lease stealing mechanism.
The solution
We need to catch and handle exceptions here when checkpointing. A sensible exception to catch is KinesisClientLibNonRetryableException. As the name suggests, there is no point in retrying this exception if it happens. We should just ignore it and continue.
The text was updated successfully, but these errors were encountered: