Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: repartition to more partitions than rows can cause crushes #2164

Closed
EtayLivne opened this issue Apr 21, 2024 · 2 comments
Closed

Fix: repartition to more partitions than rows can cause crushes #2164

EtayLivne opened this issue Apr 21, 2024 · 2 comments

Comments

@EtayLivne
Copy link

Describe the bug
There's an edge case where using repartition into more partitions than there are rows in the df will results in empty partitions, which will then crush when a transformation is applied on the df.

To Reproduce
With Daft version 0.2.21, run the following script:

df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).repartition(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

It should result in a stack trace that culminates with:

ValueError: DaftError::ValueError Error in replace: Inputs have invalid lengths: 0, 1, 1

Note: this does not happen when using into_partitions. Both the following examples work as expected (i.e transformation is succesfully applied to entire df, no crush occurs):

# Example 1
df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).into_partitions(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

# Example 2
df = daft.from_pydict({"data": ["foo", "bar", "baz"]}).repartition(4).into_partitions(4)
df.with_column("replace", df["data"].str.replace("ba", "123")).collect()

Expected behavior

  1. Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use. I'd expect the repartition(4) call from the above example to silently create only 3 partitions.
  2. Even if there's a reason for supporting empty partitions, a mechanism for not attempting to perform the apply operation on these empty partitions, or at the very least recognize the issue ahead of execution and raise an exception that the operation is expected to fail due to he existence of empty partitions.
  3. repartition and into_partitions should have consistent behavior on this

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Daft Version: 0.2.21
@jaychia
Copy link
Contributor

jaychia commented Apr 21, 2024

Thanks for the issue @EtayLivne !

To address your point about (2), this is actually specifically a bug over our string kernels! .str.replace() should absolutely work over an empty column (it just returns the empty result!) Fix: #2165

On your point about (1)

Without deep understanding of the rust engine, I'd say repartitioning should never create empty partitions in the first place, as it is hard to imagine their use.

The current query planner in Daft unfortunately still has some very strong assumptions about knowing exactly how many partitions each "stage" expects, so we don't yet have a way to dynamically change the number of partitions during execution time.

HOWEVER! 😁 We are actually building mechanisms to make partitioning a lot less burdensome on the user! Coming soon to Daft - "Adaptive Query Execution (AQE)", where Daft is able to pause execution to inspect metadata about the partitions (e.g. how many empty partitions, how imbalanced partitions are) to perform dynamic splitting/partition pruning without user input.

You should be able to test this behavior out within the next month or so... Feel free to pop by our Slack to ask us any questions!

@EtayLivne
Copy link
Author

OK, neat, and thank you!
I do have some lingering questions (such as why into_partitions and repartition display different behaviors about this) but I think I'll be asking them a bit later in Slack :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants