Fix: repartition to more partitions than rows can cause crushes #2164

EtayLivne · 2024-04-21T08:11:38Z

Describe the bug
There's an edge case where using repartition into more partitions than there are rows in the df will results in empty partitions, which will then crush when a transformation is applied on the df.

To Reproduce
With Daft version 0.2.21, run the following script:

df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).repartition(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

It should result in a stack trace that culminates with:

ValueError: DaftError::ValueError Error in replace: Inputs have invalid lengths: 0, 1, 1

Note: this does not happen when using into_partitions. Both the following examples work as expected (i.e transformation is succesfully applied to entire df, no crush occurs):

# Example 1
df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).into_partitions(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

# Example 2
df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).repartition(4).into_partitions(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

Expected behavior

Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use. I'd expect the repartition(4) call from the above example to silently create only 3 partitions.
Even if there's a reason for supporting empty partitions, a mechanism for not attempting to perform the apply operation on these empty partitions, or at the very least recognize the issue ahead of execution and raise an exception that the operation is expected to fail due to he existence of empty partitions.
repartition and into_partitions should have consistent behavior on this

Desktop (please complete the following information):

OS: Ubuntu 18.04
Daft Version: 0.2.21

The text was updated successfully, but these errors were encountered:

jaychia · 2024-04-21T20:43:32Z

Thanks for the issue @EtayLivne !

To address your point about (2), this is actually specifically a bug over our string kernels! .str.replace() should absolutely work over an empty column (it just returns the empty result!) Fix: #2165

On your point about (1)

Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use.

The current query planner in Daft unfortunately still has some very strong assumptions about knowing exactly how many partitions each "stage" expects, so we don't yet have a way to dynamically change the number of partitions during execution time.

HOWEVER! 😁 We are actually building mechanisms to make partitioning a lot less burdensome on the user! Coming soon to Daft - "Adaptive Query Execution (AQE)", where Daft is able to pause execution to inspect metadata about the partitions (e.g. how many empty partitions, how imbalanced partitions are) to perform dynamic splitting/partition pruning without user input.

You should be able to test this behavior out within the next month or so... Feel free to pop by our Slack to ask us any questions!

EtayLivne · 2024-04-22T08:00:27Z

OK, neat, and thank you!
I do have some lingering questions (such as why into_partitions and repartition display different behaviors about this) but I think I'll be asking them a bit later in Slack :)

EtayLivne closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: repartition to more partitions than rows can cause crushes #2164

Fix: repartition to more partitions than rows can cause crushes #2164

EtayLivne commented Apr 21, 2024

jaychia commented Apr 21, 2024

EtayLivne commented Apr 22, 2024

Fix: repartition to more partitions than rows can cause crushes #2164

Fix: repartition to more partitions than rows can cause crushes #2164

Comments

EtayLivne commented Apr 21, 2024

jaychia commented Apr 21, 2024

EtayLivne commented Apr 22, 2024