[PERF] enable metadata preservation across materialization points #2216

samster25 · 2024-05-02T07:02:12Z

When enabling AQE, we introduce intermediate materializations for a query with multiple shuffles.
The problem with this is that metadata is not preserved across materialization boundaries.
So if we are running a SortMergeJoin and we draw a boundary after the sort and before the join, the algorithm errors out because the boundaries value is not set on the MaterializedResult.
This happens because at the .collect() point, we place Micropartitions into the cache rather than the MaterializedResult which contains both the data and PartitionMetadata.

We already do this behavior for the ray runner, this PR formalizes it for all runners.

jaychia

LGTM overall

jaychia · 2024-05-02T19:24:53Z

daft/dataframe/dataframe.py

@@ -314,7 +316,10 @@ def _from_tables(cls, *parts: MicroPartition) -> "DataFrame":
        if not parts:
            raise ValueError("Can't create a DataFrame from an empty list of tables.")

-        result_pset = LocalPartitionSet({i: part for i, part in enumerate(parts)})
+        result_pset = LocalPartitionSet()


Should we have a LocalPartitionSet.from_tables()?

jaychia · 2024-05-02T19:28:17Z

daft/runners/pyrunner.py

@@ -67,10 +75,10 @@ def has_partition(self, idx: PartID) -> bool:
        return idx in self._partitions

    def __len__(self) -> int:
-        return sum(len(partition) for partition in self._partitions.values())
+        return sum(len(partition.partition()) for partition in self._partitions.values())


Interesting. After this PR we actually have metadata, and don't necessarily need to reach for the partition to get the length...

Would it not be possible/safe to let MaterializedResult.__len__ delegate appropriately between the metadata and the partition to get the length of the partition?

I guess it doesn't really matter given that this is a local MicroPartition though

jaychia · 2024-05-02T19:31:47Z

daft/runners/pyrunner.py

+    def build_partitions(
+        instruction_stack: list[Instruction],
+        partitions: list[MicroPartition],
+        final_metadata: list[PartialPartitionMetadata],


Nit: we could enforce same length using partitions: list[tuple[MicroPartition, PartialPartitionMetdata]]

jaychia · 2024-05-02T19:33:46Z

daft/runners/ray_runner.py

@@ -266,9 +269,12 @@ def partition_set_from_ray_dataset(
        daft_vpartitions = [
            _make_daft_partition_from_ray_dataset_blocks.remote(block, daft_schema) for block in block_refs
        ]
+        pset = RayPartitionSet()


RayPartitionSet.from_ray_materialized_results might be nice

jaychia · 2024-05-02T19:39:20Z

daft/runners/ray_runner.py

@@ -536,8 +547,25 @@ def place_in_queue(item):
                            elif len(next_step.instructions) == 0:
                                logger.debug("Running task synchronously in main thread: %s", next_step)
                                assert isinstance(next_step, SingleOutputPartitionTask)
+                                [single_partial] = next_step.partial_metadatas


This seems new - why was this necessary when before we didn't need to ensure that num_rows is available?

[PERF] enable metadata preservation across materialization points

067ef53

github-actions bot added the performance label May 2, 2024

rebase first

6933a43

samster25 requested review from clarkzinzow and jaychia May 2, 2024 07:17

jaychia approved these changes May 2, 2024

View reviewed changes

samster25 merged commit 24d0831 into main May 2, 2024
29 checks passed

samster25 deleted the sammy/enable-metadata-across-materialization-points branch May 2, 2024 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] enable metadata preservation across materialization points #2216

[PERF] enable metadata preservation across materialization points #2216

samster25 commented May 2, 2024 •

edited

Loading

jaychia left a comment

jaychia May 2, 2024

jaychia May 2, 2024

jaychia May 2, 2024

jaychia May 2, 2024

jaychia May 2, 2024

jaychia May 2, 2024

[PERF] enable metadata preservation across materialization points #2216

[PERF] enable metadata preservation across materialization points #2216

Conversation

samster25 commented May 2, 2024 • edited Loading

jaychia left a comment

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

jaychia May 2, 2024

Choose a reason for hiding this comment

samster25 commented May 2, 2024 •

edited

Loading