[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992

mikwieczorek · 2021-12-09T16:45:25Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core, Others

What happened + What you expected to happen

When using Ray Cluster (not local ray) with at least 1 worker and setting an index on Dask Dataframe results in uneven distribution. Depending on the index-column structure some partitions may be missing completely (length=0).

Groupby + apply operation on Dask DF with set index returns incomplete results. A varying number of groups are missing per run, so with each run, we get a Dask DF of a different size.

I would expect Ray to handle such a case correct, as it is a scheduler and in Dask docs of set_index it states:

shuffle: string, ‘disk’ or ‘tasks’, optional
    Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.

Apparently supplying dask.config.set(scheduler=ray_dask_get) or set_index(...).compute(scheduler=ray_dask_get)
does not work and the appropriate shuffle setting is not infered as expected.

Versions / Dependencies

python_version='3.7.7'
ray_version='1.8.0'
dask_version='2021.9.1'

Reproduction script

import dask
import dask.dataframe as dd
import numpy as np
import pandas as pd
import ray
from ray.util.dask import ray_dask_get


### Helpers
def create_df(num_unique_ids, max_num_samples, min_id, max_id):
    repetition_array = np.random.randint(1, max_num_samples, num_unique_ids)
    ids_array = np.repeat(np.random.randint(min_id, max_id, num_unique_ids), repetition_array).flatten()
    df = pd.DataFrame(
        {
            'id':  ids_array,
            'value': np.random.randint(0, max_num_samples, ids_array.size),
        }
    )
    # print(random_index.shape, repetition_array.shape)
    return df

def myparitionsize(df):
    return len(df)

def custom_func(partition, column_names):
    for col in partition.columns:
        if col in column_names:
            values = partition[col].to_numpy()
            row_sequence = set(['_'.join(str(v).split()) for v in values])
            return row_sequence

def main():
    ### RAY
    # ray.init(address='auto')  ## When run inside a Ray Cluster
    # ray.init(address='ray://127.0.0.1:10001')  ## When run with port-forwarding from k8s
    ray.init(address='ray://172.17.0.20:10001')  ## When run with Ray Cluster Luncher
    # ray.init()  ## Local Ray

    ### DASK
    dask.config.set(scheduler=ray_dask_get)
    # print(dask.config.get("scheduler", None))

    grouping_column = 'id'
    allowed_column_ids = ['id', 'value']

    # Code
    df = create_df(num_unique_ids=2000, max_num_samples=1000, min_id=1, max_id=2000)
    df['id'] = df['id'].astype(np.int64)
    df['value'] = df['value'].astype(object)

    correct_number_of_groups = df['id'].nunique()
    print("Correct number of groups in a generated DataFrame: ", correct_number_of_groups)

    ddf = dd.from_pandas(df, npartitions=4)
    print(f"Paritions before setting index:\n",ddf.map_partitions(myparitionsize).compute())
    ddf = ddf.set_index(grouping_column, sorted=False)                    #  <- wrong results
    # ddf = ddf.set_index(grouping_column, sorted=False, shuffle="tasks") #  <- correct results
    print(f"Paritions after setting index:\n",ddf.map_partitions(myparitionsize).compute())

    d = ddf.groupby(grouping_column).apply(custom_func,
                                        column_names = allowed_column_ids,
                                        meta=object
                                        ).compute(scheduler=ray_dask_get)
    # Number of returned unique groups should be equal to correct_number_of_groups calulcated using Pandas DF
    assert correct_number_of_groups == d.reset_index()['id'].nunique(), f"Correct number of groups: {correct_number_of_groups} NOT EQUAL number of groups returned {d.reset_index()['id'].nunique()}"
    
if __name__ == "__main__":
    main()

Anything else

This problem does not occur when:

Using local ray (ray.init())
Explicitly setting shuffle="tasks" in set_index method
Using Ray Cluster with a single node (head-node).

I tested this case in Ray==1.8.0 + Dask==2021.9.1 and Ray==1.9.0 + Dask=2021.11.0. Both for Client and Cluster side.

#20108 is the issue that seems to be related somehow to my case.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

clarkzinzow · 2021-12-15T18:54:05Z

Thank you for opening this issue, especially with the reproduction and drilling down the cases in which it is triggered! This does indeed appear to be an issue with Dask's shuffle algorithm inference, where it's defaulting to a disk-based shuffle even thought we're using a distributed scheduler, which is most likely resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.

I'm going to rescope this issue to having the Dask-on-Ray scheduler set this config automatically.

mikwieczorek added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 9, 2021

mikwieczorek changed the title ~~[Bug] [dask on ray] Setting index when using Ray Cluster causes unstable results and missing p~~ [Bug] [dask on ray] Setting index with Ray Cluster causes unstable results and missing partitions Dec 9, 2021

clarkzinzow added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 14, 2021

clarkzinzow self-assigned this Dec 14, 2021

clarkzinzow added this to the Core Backlog milestone Dec 14, 2021

clarkzinzow changed the title ~~[Bug] [dask on ray] Setting index with Ray Cluster causes unstable results and missing partitions~~ [Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. Dec 15, 2021

This was referenced Dec 15, 2021

[Bug] Dask on Ray set_index causes rows to disappear #20108

Closed

[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. #21114

Merged

ericl closed this as completed in #21114 Dec 17, 2021

mikwieczorek mentioned this issue Mar 10, 2022

[Bug] [Dask-on-Ray] Dask-on-Ray scheduler does not use correct shuffle settings. #23013

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992

[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992

mikwieczorek commented Dec 9, 2021 •

edited

Loading

clarkzinzow commented Dec 15, 2021

[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992

[Bug] [Dask-on-Ray] Task-based shuffle not being inferred from setting Dask-on-Ray scheduler. #20992

Comments

mikwieczorek commented Dec 9, 2021 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

clarkzinzow commented Dec 15, 2021

mikwieczorek commented Dec 9, 2021 •

edited

Loading