Exploring Enhanced Compaction Support in Rust #657

amitgilad3 · 2024-10-05T05:11:36Z

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.

In the Java version, the FileScanTask includes :

DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))
List - which is used to remove the necessary rows from existing files.

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

Add the fields DataFile & List to FileScanTask.
Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
Other possible solutions? - I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet

Would love to get your input @sdd @Xuanwo & @ZENOTME

The text was updated successfully, but these errors were encountered:

liurenjie1024 · 2024-10-17T09:57:54Z

Thanks @amitgilad3 for raising this. I think compaction is a relatively complex topic, and we are somehow far from doing this. For example, we don't support reading deletion files, we don't have transaction api. Also compaction typically requires a distributed computing engine to process it. I think a better approach would be to provide necessary primitives in this library, and help other distributed engines to do that?

amitgilad3 changed the title ~~Initial Compaction implementation strategy and missing data~~ Exploring Enhanced Compaction Support in Rust Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring Enhanced Compaction Support in Rust #657

Exploring Enhanced Compaction Support in Rust #657

amitgilad3 commented Oct 5, 2024

liurenjie1024 commented Oct 17, 2024

Exploring Enhanced Compaction Support in Rust #657

Exploring Enhanced Compaction Support in Rust #657

Comments

amitgilad3 commented Oct 5, 2024

liurenjie1024 commented Oct 17, 2024