You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
There's an edge case where using repartition into more partitions than there are rows in the df will results in empty partitions, which will then crush when a transformation is applied on the df.
To Reproduce
With Daft version 0.2.21, run the following script:
It should result in a stack trace that culminates with:
ValueError: DaftError::ValueError Error in replace: Inputs have invalid lengths: 0, 1, 1
Note: this does not happen when using into_partitions. Both the following examples work as expected (i.e transformation is succesfully applied to entire df, no crush occurs):
Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use. I'd expect the repartition(4) call from the above example to silently create only 3 partitions.
Even if there's a reason for supporting empty partitions, a mechanism for not attempting to perform the apply operation on these empty partitions, or at the very least recognize the issue ahead of execution and raise an exception that the operation is expected to fail due to he existence of empty partitions.
repartition and into_partitions should have consistent behavior on this
Desktop (please complete the following information):
OS: Ubuntu 18.04
Daft Version: 0.2.21
The text was updated successfully, but these errors were encountered:
To address your point about (2), this is actually specifically a bug over our string kernels! .str.replace() should absolutely work over an empty column (it just returns the empty result!) Fix: #2165
On your point about (1)
Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use.
The current query planner in Daft unfortunately still has some very strong assumptions about knowing exactly how many partitions each "stage" expects, so we don't yet have a way to dynamically change the number of partitions during execution time.
HOWEVER! 😁 We are actually building mechanisms to make partitioning a lot less burdensome on the user! Coming soon to Daft - "Adaptive Query Execution (AQE)", where Daft is able to pause execution to inspect metadata about the partitions (e.g. how many empty partitions, how imbalanced partitions are) to perform dynamic splitting/partition pruning without user input.
You should be able to test this behavior out within the next month or so... Feel free to pop by our Slack to ask us any questions!
OK, neat, and thank you!
I do have some lingering questions (such as why into_partitions and repartition display different behaviors about this) but I think I'll be asking them a bit later in Slack :)
Describe the bug
There's an edge case where using repartition into more partitions than there are rows in the df will results in empty partitions, which will then crush when a transformation is applied on the df.
To Reproduce
With Daft version 0.2.21, run the following script:
It should result in a stack trace that culminates with:
ValueError: DaftError::ValueError Error in replace: Inputs have invalid lengths: 0, 1, 1
Note: this does not happen when using into_partitions. Both the following examples work as expected (i.e transformation is succesfully applied to entire df, no crush occurs):
Expected behavior
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: