You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.
In the Java version, the FileScanTask includes :
DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))
List - which is used to remove the necessary rows from existing files.
I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
Add the fields DataFile & List to FileScanTask.
Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
Other possible solutions? - I am open to suggestions on alternative approaches.
I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
Thanks @amitgilad3 for raising this. I think compaction is a relatively complex topic, and we are somehow far from doing this. For example, we don't support reading deletion files, we don't have transaction api. Also compaction typically requires a distributed computing engine to process it. I think a better approach would be to provide necessary primitives in this library, and help other distributed engines to do that?
This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the
FileScanTask
class is handled between the two implementations.In the Java version, the
FileScanTask
includes :DataFile
object , which provides crucial information about partitions andspecId
,content
. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in theFileScanTask
(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:
FileScanTask
.FileScanPlan
?) ofFileScanTask
, which includes the required data but is not serializable.I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet
Would love to get your input @sdd @Xuanwo & @ZENOTME
The text was updated successfully, but these errors were encountered: