Comparison to dask #642

shoyer · 2017-06-06T19:15:27Z

ray looks like an interesting project! I see some similarities to dask (http://dask.pydata.org), especially for ad-hoc parallelism in Python. I would be interested to see a more detailed comparison. There might be some opportunities for collaboration or at least inspiration.

CC @mrocklin

robertnishihara · 2017-06-07T06:59:15Z

Thanks! This isn't a detailed comparison, but here are some thoughts that come to mind. And I agree there are lots of opportunities for collaboration/inspiration!

Remote function API: The ray.remote decorator is similar to dask.delayed, but I think it is more similar to Dask's client.submit API.
Actors: An important part of the Ray API is the actor abstraction for sharing mutable state between tasks (e.g., the state of a neural network or the state of a simulator). This blends very nicely with the side-effect free dataflow abstraction and is important for our workloads both to share state and to avoid expensive initializations. I don't think there is an analogue in Dask.
Collections: Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not.
Scheduling: Ray uses a distributed bottom-up scheduling scheme in which workers submit tasks to local schedulers, and local schedulers assign tasks to workers. Local schedulers can also forward tasks to global schedulers which can load balance between machines. Dask uses a centralized scheduler, which manages all tasks for the cluster. The point of the bottom-up scheduling approach is to improve task latency and throughput.
Data: Ray stores objects in an object store and serializes them using Apache Arrow (there is one object store process per machine). Worker processes can access objects in the object store through shared memory with minimal deserialization (and without copying the data). I don't think Dask has an analogue of the Ray object store.
System state/metadata: Ray uses a sharded database (implemented as multiple Redis servers), to store the metadata and control state of the system. This is a pretty central feature of our current design, and leads to lots of different design decisions.

And presumably lots of other things as well. @mrocklin please correct me if I said something wrong.

mrocklin · 2017-06-07T11:12:27Z

This description sounds good to me from the Dask perspective. I think that the centralized vs bottom-up/distributed scheduling is maybe the central difference.

The question I'm now curious about is "is there anything that Dask should learn and copy from Ray?" I hope that this question comes across more as flattering than as encroaching :) For example, I suspect that we could copy something like the actor model API decently easily. We do something already with long-running clients on workers (here is a simple script that includes an example) but it could be that this isn't the way that people want to think about these sorts of problems.

There are likely some core differences that Dask will probably never be able to support, but I'd be very open to seeing if there are opportunities or advantages for collaboration in some settings.

robertnishihara · 2017-06-08T07:10:29Z

The actor API is a good candidate :) Queues give a lot of flexibility, but can significantly complicate fault tolerance.

The work we've been doing on serialization using Apache Arrow could potentially/hopefully be useful for other projects as well.

robertnishihara · 2017-07-14T17:19:48Z

Closing this for now, feel free to reopen or continue the discussion.

colobas · 2018-03-05T16:29:18Z

Hey, congrats on the great work. Another question I think is relevant, regarding the comparison between Dask and Ray is that of High-Availability.

Specifically, afaik, Dask can only have a single central scheduler, and as of now it wouldn't be possible to have redundant schedulers, ready to take-over in the case of a scheduler crash. How does Ray compare in this regard?

( @mrocklin correct me if I'm wrong)

robertnishihara · 2018-03-05T23:11:36Z

Hi @colobas, scheduling works a bit differently in Ray. Each machine has its own scheduler (which is responsible for managing the workers on that machine), and failures are handled at the granularity of machines, so if the scheduler on a given machine dies, the whole machine is considered to have failed. Objects that were lost and are needed will be recreated by rerunning the tasks that created those objects (specifications of the tasks are stored in a sharded in-memory database, currently we are not resilient to the failure of this database, but we're prototyping a fault tolerance scheme based on chain replication for it).

Currently there are some limitations of the kinds of failures we handle as described in http://ray.readthedocs.io/en/latest/fault-tolerance.html.

cc @stephanie-wang @concretevitamin

# What is this Python project? Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API. Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search. # What's the difference between this Python project and similar ones? - Similar to Dask, see a comparison here: ray-project/ray#642 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization - Achieves lower latency with bottom up scheduling

robertnishihara closed this as completed Jul 14, 2017

jmargeta mentioned this issue Oct 22, 2018

Add Ray to Cluster computing vinta/awesome-python#1170

Merged

devin-petersohn mentioned this issue Mar 28, 2019

Query: What is the difference between Dask and Modin? modin-project/modin#515

Closed

mmccarty mentioned this issue Aug 9, 2019

Add draft of institutional FAQ dask/dask#5214

Merged

2 tasks

zachary62 mentioned this issue Oct 31, 2019

Can I set priority for my tasks #6057

Open

abdulelahsm mentioned this issue Nov 14, 2020

[DOCS] Update ReadMe.md with Modin vs Dask DataFrame explanation modin-project/modin#2433

Closed

bjuergens mentioned this issue Nov 16, 2020

catch up to old repo neuroevolution-ai/NeuroEvolution-CTRNN_new#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison to dask #642

Comparison to dask #642

shoyer commented Jun 6, 2017

robertnishihara commented Jun 7, 2017

mrocklin commented Jun 7, 2017

robertnishihara commented Jun 8, 2017

robertnishihara commented Jul 14, 2017

colobas commented Mar 5, 2018 •

edited

Loading

robertnishihara commented Mar 5, 2018

Comparison to dask #642

Comparison to dask #642

Comments

shoyer commented Jun 6, 2017

robertnishihara commented Jun 7, 2017

mrocklin commented Jun 7, 2017

robertnishihara commented Jun 8, 2017

robertnishihara commented Jul 14, 2017

colobas commented Mar 5, 2018 • edited Loading

robertnishihara commented Mar 5, 2018

colobas commented Mar 5, 2018 •

edited

Loading