Repartition Ray dataset if number of shards is too small #283

krfricke · 2023-06-23T15:14:50Z

Currently we throw an error when the number of partitions in a data source is too small for the number of workers.

However, in the case of Ray datasets, we can actually repartition the dataset ourselves.

This will also ensure our quickstart examples, such as in https://docs.ray.io/en/latest/train/train.html#quick-start-to-distributed-training-with-ray-train will work out of the box.

Yard1 · 2023-06-23T17:09:27Z

@krfricke I am a little confused - we should not be throwing any errors as we have moved to a split based solution for Ray Datasets (see get_actor_shards). That's the recommended way instead of repartitioning. How is this error triggered?

krfricke · 2023-06-23T19:14:59Z

The errors comes up when running e.g. the quickstart example from our docs:


import ray
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration.
        use_gpu=False,
    ),
    label_column="target",
    num_boost_round=20,
    params={
        # XGBoost specific params
        "objective": "binary:logistic",
        # "tree_method": "gpu_hist",  # uncomment this to use GPU for training
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()

But I see what you mean - we do the split ourselves, so no need to actually call repartition. We can just check if the data source supports automatic repartitioning.

I've updated the PR.

Yard1

Thanks!

Kai Fricke added 2 commits June 23, 2023 16:13

Repartition Ray dataset if number of shards is too small

7967cd4

test

6c4b1cc

keep split

5304c26

Kai Fricke added 2 commits June 23, 2023 20:15

false

d0ef91a

remove test

58a92b0

Yard1 approved these changes Jun 23, 2023

View reviewed changes

krfricke merged commit 8668a77 into ray-project:master Jun 24, 2023

krfricke deleted the repartition branch June 24, 2023 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repartition Ray dataset if number of shards is too small #283

Repartition Ray dataset if number of shards is too small #283

krfricke commented Jun 23, 2023

Yard1 commented Jun 23, 2023

krfricke commented Jun 23, 2023

Yard1 left a comment

Repartition Ray dataset if number of shards is too small #283

Repartition Ray dataset if number of shards is too small #283

Conversation

krfricke commented Jun 23, 2023

Yard1 commented Jun 23, 2023

krfricke commented Jun 23, 2023

Yard1 left a comment

Choose a reason for hiding this comment