Machine Learning, Online, Streaming, Kafka Streams, Kotlin, JVM
I encourage everyone to see a presentation first.
- pp-docs - a solution blueprints and other presentation assets
- pp-infrastructure - a simple gradle build / docker-compose to spin up an infrastructure including dumped measurements
- pp-data-preparation - a set of streams to clean-up and prepare data
- pp-pollution-predictor - a set of streams for data clustering
- pp-data-loader - a simple tool that reads input (MongoDb) data and load it to kafka topics
- pp-generator - a simple tool to generate load and send it to kafka topics
- presentation-devoxx_2021-08-25.pdf - a presentation
A project was created to show a practical end-to-end example how to deal with online machine learning problems (using Kafka Streams). Its main purpose is to predict pollution level based on historical data and current weather conditions.
A main idea is to compare 'old' predictions (24h ahead) with current measurements. That is an input for clusterization. Date models (one per each kafka key) are dynamic, and they change in a streaming way. The beauty of this solution is that we do not need to store all historical measurements. As we may read in a paper: 'Fast and accurate k-means', it is enough to just persist coordinates of a cluster. I used for that Kafka Streams' state stores, which are scalable and durable. In order to achieve this I had to create a serializable custom implementation of that algorithm (a fork of Mahout's one).
A solution is quasi-online. We create statistics (e.g. average, standard deviation, correlation) in a streaming fashion, however after a training period passes we use their fixed snapshot for data normalization. 'A switch' is dynamic and was achieved using a custom Kafka Streams' window store.
A code is organized around two main JVMs: pp-data-preparation and pp-pollution-predictor. Those are Spring Boot apps combined with Spring Cloud Stream framework. Each app consists of several Kafka Streams topologies.
A high level diagram shows a data flow. Arrows represent producers and consumers:
pp-infrastructure> ./gradlew start
pp-generator> ./gradlew bootRun # or pp-data-loader>
pp-data-preparation> ./gradlew bootRun
pp-pollution-predictor> ./gradlew bootRun
- 8+ GB RAM
- recommended Linux platform
- 7zip, mongodb-tools on a PATH