[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992
Closed
1 of 2 tasks
Labels
Milestone
Search before asking
Ray Component
Ray Core, Others
What happened + What you expected to happen
When using Ray Cluster (not local ray) with at least 1 worker and setting an index on Dask Dataframe results in uneven distribution. Depending on the index-column structure some partitions may be missing completely (length=0).
Groupby + apply operation on Dask DF with set index returns incomplete results. A varying number of groups are missing per run, so with each run, we get a Dask DF of a different size.
I would expect Ray to handle such a case correct, as it is a scheduler and in Dask docs of
set_index
it states:Apparently supplying
dask.config.set(scheduler=ray_dask_get)
orset_index(...).compute(scheduler=ray_dask_get)
does not work and the appropriate shuffle setting is not infered as expected.
Versions / Dependencies
python_version='3.7.7'
ray_version='1.8.0'
dask_version='2021.9.1'
Reproduction script
Anything else
This problem does not occur when:
ray.init()
)shuffle="tasks"
inset_index
methodI tested this case in Ray==1.8.0 + Dask==2021.9.1 and Ray==1.9.0 + Dask=2021.11.0. Both for Client and Cluster side.
#20108 is the issue that seems to be related somehow to my case.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: