Skip to content

Data Pipeline Important Considerations

Jinlian(Sunny) Wang edited this page Sep 20, 2021 · 6 revisions
  • Record ordering

Kafka and Kinesis only maintain ordering in each partition. Try to set partition key to a data attribute that make sense to this application, say customer SSO ID or customer reference ID.

  • Error handling/DLQ

Set up DLQ if its support is built-in, like SQS. Otherwise, catch the exception and write it to another queue or stream, like another Kafka topic. See more: https://stackoverflow.com/questions/32501985/amazon-kinesis-aws-lambda-retries. Surviving poison messages in MSMQ

  • Duplicates/deduplication/idempotent

Try best to make pipeline to be idempotent. If not, have each record to have a unique identifier, and then query a DB to make sure no double handling.

  • Offset Keeping/Checkpointing

Kafka would keep offset for each consumer group and know which one to distribute when some consumers in a group crashes or all of them crashes.

  • Cross region auto failover/Resiliency

Each pipeline has two states: "0" or "1". "0" is to handle both regions while "1" is handling own region. Read this flag from S3 bucket in own region. Have a Lambda to monitor metrics from the other region, if alarmed, put the pipeline in the same region to be handling both (flag:"0"), otherwise, put it to handle only own region (flag: "1").

Clone this wiki locally