How to determine the freshness of data in streaming data lake tables? #2483
zhoujinsong
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently, a lot of users in the community use compute engines such as Flink to write data in streaming mode to data lake tables. Ideally, the data in the data lake should always remain fresh. However, due to factors such as the availability of the data source, the availability and performance of the ingestion jobs, the data in the tables may be behind.
For users and managers of the tables, being able to perceive the freshness of the data in the table is crucial. On one hand, when there are freshness issues with the table, it is necessary to promptly investigate and repair the problem. On the other hand, many times, when reading the data in the table, the business needs can only be met once the data freshness catches up.
Calculating and displaying the freshness of the data in the table accurately has become the key to solving this problem.
The last commit time on the table can certainly indicate whether the streaming writing job is still functioning properly, but it cannot accurately reflect the freshness of the data in the table in cases of data delay at the source or accumulation of the writing job.
The concept of "watermark" in the streaming computing field is more suitable as a representation of the freshness of data in the table. A watermark is a timestamp representing the point in time before which all data has been written to the table. However, due to data being out of order, accurately calculating the watermark in the table is also not easy. Additionally, there are differences in the current support for watermarks among various data lake table formats. Paimon supports automatic writing of the watermark into Paimon tables based on the watermark definition of Flink jobs, while Iceberg tables currently do not have the watermark property.
I would like to use this discussion to gather everyone's actual requirements for data freshness in streaming data lake scenarios. Based on these needs, we can also discuss how to more accurately calculate and display the freshness of the data in the table.
Beta Was this translation helpful? Give feedback.
All reactions