- Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
- Apache Hudi - Design Principles
- OpenTelemetry specification V1.0
- Getting started with Dataengineering Volume 5 🎉
- Getting started with Data Engineering, volume 4 🎉💡
- Getting started with Data Engineering, volume 3 🎉💡
- Getting started with Data Engineering, volume 2 🎉💡
- Getting started with Data Engineering, volume 1 🎉💡
- Apache Airflow 2.0
- Some Interesting essentials while learning Apache Airflow
- Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release 🎉🎉🎉
- #getdbt or Data Build Tools interface across all major Data Workflow Management Platform 💯✨🔥
- Apache Superset - An #opensource Fully Featured Business Intelligence Application 🎊🎊🎊
- The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration 💯💡⭐
- Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! 🎉
- Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL 🎉
- Apache Spark is NOT a Map Reduce but an MPP/MPI Engine
- DataEngg Skills to work with DataScience
- Data Quality, A necessity for Data Driven Projects
- Essential Cloud Skills for Data Engineering
- Open Source Technologies in Data Engineering
- Kubernetes Fundamentals Required as a Data Engineer
- Apache Superset, OSS Business Intelligence for 2021
- #apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional 💬 quotes ⏳💡⏳
- Processing Guarantees in #apachekafka 💯🔆🎉 - The best resource
- Change Data Analysis with Debezium and Apache Pinot 🎉💡🚿
- Optimizing Apache Kafka Producers & Consumers 🎊📈🎉
- Redpanda -A NON-JVM Streaming Platform for mission critical workloads 💡🎉🔆
- Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
- Apache Iceberg - an open table format for huge analytic datasets
- Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
- ZooKeeper, a distributed, open-source coordination service for distributed applications
- Apache Iceberg - Partition Evolution, its simple but its so amazing
- A Data Engineering Story - The Beginning
- Data Engineering - More towards Data Science or Data Analytics or ...
- Data Engineering Interview Patterns
- Basic Checklists while learning Apache Spark
- #apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
- Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem 🔐💡🔒
- Nextflow is a Workflow Manager exclusively for #bioinformatics 🩹💊🩹
- #apachespark Project Zen Update - Making PySpark Better 💡🔗💡
- Design - Exactly Once Delivery & Transactional Messaging in #apachekafka 🎊📋🎊
- SQL Database on Kubernetes - Best Practices
- Devtron - An Open Source DevOps on Kubernetes, written in Go 🥇🎁🎉
- Most Popular #opensource BI & Data Analytics Platforms 🎊💡🎉
- datapipelines Dataframe APi is now available with #apachebeam 💯🔥💯
- Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink 🔅🎉🔅
- Kubernetes Api Structure 💯✔️💯
- Architecting a Kubernetes Infrastructure 💯
- Exploring Kubernetes Operator Pattern 💡
- Docker is an interal part of Data Engineering ML pipeline & that makes security 🔐 extremely essential
- Machine Learning Workflow 💯
- Dummy Notes On Machine Learning Infrastructire
- Machine Learning Feature Store 💯
- Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
- List of #machinelearning & #dataengineering Technologies will be following in 2021 🎉💡🎉
- MLOps - ZenML #machinelearning with reproducible pipelines ✅💯✅
- Streamlit Healthcare Machine Learning Data App
- Dstack AI - An open-source tool to develop data applications with Python 🎉💭🎊
- Adversarial Robustness Toolbox - a Python library for #machinelearning Security 💡🔎🔓
- Biopython is a set of freely available tools for biological computation written in #Python 💊⌛️💊
- Time to Know More about DASK
- DataEngineering vs Machine Learning
- A good #machinelearning Model is only possible with a good quality of #data. ⌛️
- Statistics for #softwareengineer 🔥💯🔥
- Monitoring #machinelearning Applications 🎉🎁 🎉
- Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven 🥇🥇🥇
- Short Notes on -Open source #machinelearning Tracking System
- The best example of Randomness is - #machinelearning model in Production. 🔐💭🔎
-
Most important points around Distributed #dataengineering Platform
-
Fundamental of #distributedsystems Scaling - Avoiding Co-ordination 🎊♨️🔆
-
Paper on Wander Join: Online Aggregation via Random Walks 📃💭📑 Join problem
-
The Delta Lake Paper - High-Performance ACID Table Storage 📋💡📋
-
Dynamo - AWS Highly Available Key-value Store #distributedsystem 💬💡🎉
-
Lakehouse - A Paper on new Generation of #datawarehouse technology 💡🔎💡
-
Calvin: Fast Distributed Transactions for Partitioned Database Systems 📝📝
-
Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto 💭🎊💡
-
Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
-
Apache Kafka Paper : Distributed Messaging System for Log Processing