kafka-in-production

Curious to know how big companies are operating their kafka fleet in production? This might be the repo for you:

What are the issues encountered when running kafka in production? 📝
How other organisations attempt to solve the issues? 🛠️
Why certain approaches are adopted over others? ⚖️
What can we learn for our own use case?

Adobe
Agoda
Airbnb
Allegro
Apple
AppsFlyer
BigCommerce
Bitpanda
Bloomberg
Bolt
Booking.com
Brex
CERN
Cloudflare
Cloudera
Coinbase
Criteo
Datadog
DoorDash
Decathlon
Deliveroo
GoTo
Grab
HelloFresh
Honeycomb
Hubspot
Indeed
Klarna
LinkedIn
Lyft
Michelin
Monzo
Morgan Stanley
Netflix
New Relic
PayPal
Pinterest
Platformatory
Riskified
Robinhood
Shopify
Slack
Stripe
Uber
Wise
Wix
Yelp
Zalando
Zendesk
Zopa Bank

Adobe

How Adobe Experience Platform Pipeline Became the Cornerstone of In-Flight Processing for Adobe - 2019 - 📚
Moving Beyond Newtonian Reductionism in the Management of Large-Scale Distributed Systems, Part 2 - 2019 - 📚
Adobe Experience Platform’s Streaming Sources and Destinations Overview and Architecture - 2019 - 📚
Wins from Effective Kafka Monitoring at Adobe: Stability, Performance, and Cost Savings - 2019 - 📚
Creating Adobe Experience Platform Pipeline with Kafka - 2018 - 📚

Agoda

How We Solve Load Balancing Challenges in Apache Kafka - 2024 - 📚
How Agoda manages 1.5 Trillion Events per day on Kafka - 2021 - 📚
Adding Time Lag to Monitor Kafka Consumer - 2021 - 📚
How our data scientists' petabytes of data is ingested into Hadoop (from Kafka) - 2021 - 📚

Airbnb

Migrating Kafka transparently between Zookeeper clusters - 2021 - 📚

Allegro

Unlocking Kafka's Potential: Tackling Tail Latency with eBPF - 2024 - 📚

Apple

Leveraging Tiered Storage in Strimzi-Operated Kafka for Cost-Effective Streaming Applications - 2024 - 🎙️
Balance Kafka Cluster with Zero Data Movement - 2023 - 🎙️
Experiences Operating Apache Kafka® at Scale - 2019 - 🎙️
Kafka as a Service A Tale of Security and Multi Tenancy - 2018 - 🎙️

AppsFlyer

Four Crucial Steps to Take Before Changing Kafka Partition Key at Scale - 2023 - 📚
Kafka Lag Monitoring For Human Beings - 2020 - 🎙️
Apache Kafka Lag Monitoring at AppsFlyer - 2020 - 📚
Managing your Kafka in an explosive growth environment - 2019 - 🎙️

BigCommerce

Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML, Data Lake, and Beyond - 2023 - 🎙️

Bitpanda

Bitpanda's new trade engine - Part #1 - asynchronous trading leveraging Kafka - 2023 - 📚

Bloomberg

Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools - 2022 - 🎙️

Bolt

Using Apache Kafka and ksqlDB for Data Replication at Bolt - 2021 - 🎙️
How Bolt Has Adopted Change Data Capture with Confluent Platform - 2020 - 📚
Kewei Shang - 2020 - 📚

Booking.com

Data Streaming Ecosystem Management at Booking.com - 2018 - 📚

Brex

Transactional Events Publishing At Brex - 2022 - 📚

CERN

CERN IoT Kafka Pipelines - 2024 - 🎙️

Cloudflare

All about Kafka - 2024 - 🎙️
Intelligent, automatic restarts for unhealthy Kafka consumers - 2023 - 📚
Using Apache Kafka to process 1 trillion inter-service messages - 2022 - 📚

Cloudera

Using Streams Replication Manager Prefixless Replication for Kafka Topic Aggregation - 2024 - 📚
Streams Replication Manager Prefixless Replication - 2024 - 📚

Coinbase

Kafka infrastructure renovation at Coinbase - 2022 - 📚
How we scaled data streaming at Coinbase using AWS MSK - 2021 - 📚

Criteo

Managing Kafka and Data Streams at Criteo - 2023 - 📚
Upgrading Kafka on a large infra, or: when moving at scale requires careful planning - 2019 - 📚
How Criteo is managing one of the largest Kafka Infrastructure in Europe - 2019 - 📚

Crowdstrike

Real-time Adaptive Controls for Kafka Consumers - 2024 - 🎙️

Datadog

Running Production Kafka Clusters in Kubernetes - 2019 - 🎙️

Decathlon

Seamless data exchange with Kafka Connect and Strimzi on Kubernetes at Decathlon - 2024 - 📚

Deliveroo

Improving Stream Data Quality With Protobuf Schema Validation - 2019 - 📚

Doordash

DoorDash Empowers Engineers with Kafka Self-Serve - 2024 - 📚
API-First Approach to Kafka Topic Creation - 2023 - 📚
Building Scalable Real Time Event Processing with Kafka and Flink - 2020 - 📚
Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime - 2020 - 📚

GoTo

Sink Kafka Messages to ClickHouse Using 'ClickHouse Kafka Ingestor' - 2022 - 📚
When Kafka Went Offshore - 2021 - 📚
Enhancing Ziggurat - The Backbone Of Gojek's Kafka Ecosystem - 2021 - 📚
Handling Dead Letters in a Streaming System - 2020 - 📚
How Kafka Solved a Culture Problem at Gojek - 2019 - 📚
Fronting : An Armoured Car for Kafka Ingestion - 2018 - 📚
Sakaar: Taking Kafka data to cloud storage at GO-JEK - 2018 - 📚

Grab

Kafka on Kubernetes: Reloaded for fault tolerance - 2023 - 📚
Zero trust with Kafka - 2022 - 📚
How Kafka Connect helps move data seamlessly - 2022 - 📚
Exposing a Kafka Cluster via a VPC Endpoint Service - 2022 - 📚
Detect Fraud Successfully with GrabDefence! - 2021 - 🎙️
Optimally Scaling Kafka Consumer Applications - 2020 - 📚

HelloFresh

ProtoMock: Simple Kafka Testing by Generating Mock Data from Protobuf Schemas - 2023 - 📚
Renaming a Kafka topic - 2023 - 📚

Honeycomb

Scaling Telemetry Systems with Streaming - 2023 - 🎙️
Lessons Learned From the Migration to Confluent Kafka - 2021 - 📚
Scaling Kafka at Honeycomb - 2021 - 📚
Bitten by a Kafka Bug - Postmortem - 2019 - 📚

Hubspot

Our Journey to Multi-Region: Supporting Cross-Region Kafka Messaging - 2022 - 📚

Indeed

Indeed Flex: The Story of a Revolutionary Recruitment Platform - 2023 - 🎙️

Klarna

Evolving a Real-time Fraud Barrier with Kafka - 2024 - 🎙️

LinkedIn

Load-balanced Brooklin Mirror Maker: Replicating large-scale Kafka clusters at LinkedIn - 2022 - 📚
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters - 2022 - 📚
How LinkedIn customizes Apache Kafka for 7 trillion messages per day - 2019 - 📚
URP? Excuse You! The Three Metrics You Have to Know - 2018 - 🎙️
Test Strategy for Samza/Kafka Services - 2017 - 📚
Kafka Ecosystem at LinkedIn - 2016 - 📚
Kafkaesque Days at LinkedIn – Part 1 - 2016 - 📚
How We’re Improving and Advancing Kafka at LinkedIn - 2015 - 📚

Lyft

Evolution of Streaming Pipeline at Lyft - 2023 - 🎙️
Building an Adaptive, Multi-Tenant Stream Bus with Kafka and Golang - 2020 - 📚
Can Kafka Handle a Lyft Ride? - 2020 - 🎙️
Operating Apache Kafka Clusters 24/7 Without A Global Ops Team - 2019 - 📚
Bulletproof Apache Kafka® with Fault Tree Analysis - 2019 - 🎙️
Production Ready Kafka on Kubernetes - 2019 - 🎙️

Monzo

How we built a queue on top of Kafka - 2024 - 📚

Michelin

Designing Kafka Streams Applications - 2024 - 📚
Contributing to open source software : AKHQ - 2024 - 📚
How to 'Kstreamplify' : your new way to develop Kafka Streams application - 2023 - 📚
From Monolithic Orchestrator to Streaming with Microservices - 2023 - 🎙️
Migrate Applications from Kafka On-Premise to Confluent Cloud - 2022 - 📚
The Michelin Guide: an unexpected event driven use case - 2022 - 📚
Moving from orchestration to choregraphy - Part 3 - 2022 - 📚
Moving from orchestration to choregraphy - Part 2 - 2021 - 📚
Moving from orchestration to choregraphy - Part 1 - 2021 - 📚
“The metamorphose” of our Information System by Implementing a distributed event streaming platform - 2021 - 📚

Morgan Stanley

Consistent, High-throughput, Real-time Calculation Engines Using Kafka Streams - 2023 - 🎙️

Netflix

Self-Hosting Kafka at Scale: Netflix's Journey and Challenges - 2024 - 🎙️
Featuring Apache Kafka in the Netflix Studio and Finance World - 2020 - 📚
Inca — Message Tracing and Loss Detection For Streaming Data @Netflix - 2019 - 📚
Evolution of the Netflix Data Pipeline - 2016 - 📚
Kafka Inside Keystone Pipeline - 2016 - 📚

New Relic

Scaling Data Ingestion: Overcoming Challenges with Cell Architecture - 2024 - 🎙️
Keep Your Kafka Cloud Costs in Check with Showbacks - 2024 - 🎙️
Tuning Apache Kafka Consumers to maximize throughput and reduce costs - 2024 - 📚
20 best practices for Apache Kafka at scale - 2018 - 📚
Using Apache Kafka for Real-Time Event Processing at New Relic - 2018 - 📚
Best practices and strategies for Kafka topic partitioning - 2021 - 📚
AWS re:Invent 2020: How New Relic is migrating its Apache Kafka cluster to Amazon MSK - 2021 - 🎙️
New Relic case: "Huge scale, small clusters: Using Cells to scale in the Cloud" - 2021 - 🎙️
Monitoring Kafka without instrumentation using eBPF - 2022 - 🎙️
Key Metrics To Uncover the Root Cause of Kafka Performance Anomalies - 2022 - 🎙️
Reducing Impact of Single Broker Failures in Kafka - 2023 - 🎙️
Go Big or Go Home: Approaching Kafka Replication at Scale - 2023 - 🎙️
Mitigating Kafka Broker ‘Gray’ Failures For Key Based Partitioners With Partition Multihoming - 2023 - 🎙️
Monitoring Apache Kafka for cloud cost reduction - 2023 - 📚

Paypal

Scaling Kafka to Support PayPal’s Data Growth - 2023 - 📚
Scaling Kafka Consumer for Billions of Events - 2021 - 📚
Marching Toward a Trillion Kafka Messages per Day: Running Kafka at scale at PayPal - 2020 - 🎙️

Pinterest

Pinterest Tiered Storage for Apache Kafka®️: A Broker-Decoupled Approach - 2024 - 📚
Pinterest’s Journey to a Automated, Efficient, and Low-Maintenance Kafka Platform - 2024 - 🎙️
Lessons Learned from Running Apache Kafka at Scale at Pinterest - 2021 - 📚
How Pinterest runs Kafka at scale - 2018 - 📚
Open sourcing DoctorKafka: Kafka cluster healing and workload balancing - 2017 - 📚

Platformatory

Kafka Latency Analyzer: Get Insights into Per-record, End-to-end Latency - 2023 - 🎙️

Riskified

How to Manage Schemas and Handle Standardization - 2023 - 📚
How to Roll Your Kafka Cluster With Zero Downtime and No Data Loss - 2023 - 📚
Know Your Limits: Cluster Benchmarks - 2022 - 📚
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction - 2022 - 🎙️
From AWS CloudFormation to Terraform: Migrating Apache Kafka - 2021 - 📚

Robinhood

Robinhood’s Kafka Journey from EC2 to Kubernetes - 2024 - 🎙️
Robinhood’s Kafkaproxy: Decoupling Kafka Consumer Logic from Application Business Logic - 2023 - 🎙️
Tackling Kafka, with a Small Team - 2019 - 🎙️

Shopify

Resilient Kafka: How DNS Traffic Management and Client Wrappers Ensure Availability - 2023 - 🎙️
Capturing Every Change From Shopify’s Sharded Monolith - 2021 - 📚
Running Apache Kafka on Kubernetes at Shopify - 2018 - 📚
Kafka Producer Pipeline for Ruby on Rails - 2014 - 📚

Slack

Building Self-driving Kafka clusters using open source components - 2022 - 📚
Building Self-driving Kafka clusters using open source components - 2022 - 📚

Stripe

Mastering Kafka at Scale: Unleashing the Power of Temporal at Stripe | Replay 2023 - 2023 - 🎙️
6 Nines: How Stripe keeps Kafka highly-available across the globe - 2022 - 🎙️

Uber

Protobuf Support in Uber's Real-Time Data Stack - 2024 - 🎙️
Topic Federation: Enhance Kafka Availabilty with Sharded Topics Across Clusters - 2024 - 🎙️
Introduction to Kafka Tiered Storage at Uber - 2024 - 📚
Exactly-Once Stream Processing at Scale in Uber - 2024 - 🎙️
Learnings of Running Kafka Tiered Storage at Scale - 2023 - 🎙️
Securing Kafka® Infrastructure at Uber - 2022 - 📚
Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot - 2021 - 📚
Introducing uGroup: Uber’s Consumer Management Framework - 2021 - 📚
Disaster Recovery for Multi-Region Kafka at Uber - 2020 - 📚
Kafka Cluster Federation at Uber - 2019 - 🎙️
Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka - 2018 - 📚
Introducing Chaperone: How Uber Engineering Audits Apache Kafka End-to-End - 2016 - 📚
uReplicator: Uber Engineering’s Robust Apache Kafka Replicator - 2016 - 📚

Wise

Streaming Infrastructure at Wise - 2023 - 🎙️
Rack awareness in Kafka Streams - 2022 - 📚
Teamwork: Implementing a Kafka retry strategy at Wise - 2021 - 📚
Running Kafka in Kubernetes, Part 1: Why we migrated our Kafka clusters to Kubernetes. - 2021 - 📚
Running Kafka in Kubernetes, Part 2: How we migrated our Kafka clusters to Kubernetes. - 2021 - 📚
Securing Kafka with SPIFFE at TransferWise - Jonathan Oddy, Levani Kokhreidze - 2020 - 🎙️
Achieving high availability with stateful Kafka Streams applications - 2018 - 📚

Wix

4 Steps for Kafka Rebalance - Notes From the Field - 2021 - 📚
Wix’s Journey Into Data Streams - 2021 - 📚
Building a High-level SDK for Kafka: Greyhound Unleashed - 2020 - 📚

Yelp

Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 2 - Migration) - 2022 - 📚
Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 1 - Architecture) - 2021 - 📚
Streams and Monk – How Yelp is Approaching Kafka in 2020 - 2020 - 📚
Billions of Messages a Day – Yelp’s Real-time Data Pipeline - 2017 - 🎙️

Zalando

Rock Solid Kafka and ZooKeeper Ops on AWS - 2018 - 📚
Many-to-Many Relationships Using Kafka - 2018 - 📚
Event First Development - Moving Towards Kafka Pipeline Applications - 2017 - 📚
Reattaching Kafka EBS in AWS - 2017 - 📚
Real-time Ranking with Apache Kafka’s Streams API - 2017 - 📚
Running Kafka Streams applications in AWS - 2017 - 📚
A Recipe for Kafka Lag Monitoring - 2017 - 📚
Surviving Data Loss - 2017 - 📚

Zendesk

No Access Denied: Our Transition to Kafka ACLs - 2024 - 📚
Seamless Transition: Migrating Kafka Cluster to Kubernetes - 2024 - 📚
Kafka: Automating Root CA rotation with Vault - 2023 - 📚
Implementing mTLS and Securing Apache Kafka at Zendesk - 2021 - 📚
An investigation into Kafka Log Compaction - 2020 - 📚
Kafka on Ruby - 2020 - 📚
Create a test data generator using Kafka Connect - 2018 - 📚

Zopa Bank

Highly Available Kafka Consumers and Kafka Streams on Kubernetes - 2023 - 🎙️

Files

README.md

Latest commit

History

README.md

File metadata and controls

kafka-in-production

Table of Contents

Adobe

Agoda

Airbnb

Allegro

Apple

AppsFlyer

BigCommerce

Bitpanda

Bloomberg

Bolt

Booking.com

Brex

CERN

Cloudflare

Cloudera

Coinbase

Criteo

Crowdstrike

Datadog

Decathlon

Deliveroo

Doordash

GoTo

Grab

HelloFresh

Honeycomb

Hubspot

Indeed

Klarna

LinkedIn

Lyft

Monzo

Michelin

Morgan Stanley

Netflix

New Relic

Paypal

Pinterest

Platformatory

Riskified

Robinhood

Shopify

Slack

Stripe

Uber

Wise

Wix

Yelp

Zalando

Zendesk

Zopa Bank