Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison to dask #642

Closed
shoyer opened this issue Jun 6, 2017 · 6 comments
Closed

Comparison to dask #642

shoyer opened this issue Jun 6, 2017 · 6 comments

Comments

@shoyer
Copy link

shoyer commented Jun 6, 2017

ray looks like an interesting project! I see some similarities to dask (http://dask.pydata.org), especially for ad-hoc parallelism in Python. I would be interested to see a more detailed comparison. There might be some opportunities for collaboration or at least inspiration.

CC @mrocklin

@robertnishihara
Copy link
Collaborator

Thanks! This isn't a detailed comparison, but here are some thoughts that come to mind. And I agree there are lots of opportunities for collaboration/inspiration!

  • Remote function API: The ray.remote decorator is similar to dask.delayed, but I think it is more similar to Dask's client.submit API.

  • Actors: An important part of the Ray API is the actor abstraction for sharing mutable state between tasks (e.g., the state of a neural network or the state of a simulator). This blends very nicely with the side-effect free dataflow abstraction and is important for our workloads both to share state and to avoid expensive initializations. I don't think there is an analogue in Dask.

  • Collections: Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not.

  • Scheduling: Ray uses a distributed bottom-up scheduling scheme in which workers submit tasks to local schedulers, and local schedulers assign tasks to workers. Local schedulers can also forward tasks to global schedulers which can load balance between machines. Dask uses a centralized scheduler, which manages all tasks for the cluster. The point of the bottom-up scheduling approach is to improve task latency and throughput.

  • Data: Ray stores objects in an object store and serializes them using Apache Arrow (there is one object store process per machine). Worker processes can access objects in the object store through shared memory with minimal deserialization (and without copying the data). I don't think Dask has an analogue of the Ray object store.

  • System state/metadata: Ray uses a sharded database (implemented as multiple Redis servers), to store the metadata and control state of the system. This is a pretty central feature of our current design, and leads to lots of different design decisions.

And presumably lots of other things as well. @mrocklin please correct me if I said something wrong.

@mrocklin
Copy link

mrocklin commented Jun 7, 2017

This description sounds good to me from the Dask perspective. I think that the centralized vs bottom-up/distributed scheduling is maybe the central difference.

The question I'm now curious about is "is there anything that Dask should learn and copy from Ray?" I hope that this question comes across more as flattering than as encroaching :) For example, I suspect that we could copy something like the actor model API decently easily. We do something already with long-running clients on workers (here is a simple script that includes an example) but it could be that this isn't the way that people want to think about these sorts of problems.

There are likely some core differences that Dask will probably never be able to support, but I'd be very open to seeing if there are opportunities or advantages for collaboration in some settings.

@robertnishihara
Copy link
Collaborator

The actor API is a good candidate :) Queues give a lot of flexibility, but can significantly complicate fault tolerance.

The work we've been doing on serialization using Apache Arrow could potentially/hopefully be useful for other projects as well.

@robertnishihara
Copy link
Collaborator

Closing this for now, feel free to reopen or continue the discussion.

@colobas
Copy link

colobas commented Mar 5, 2018

Hey, congrats on the great work. Another question I think is relevant, regarding the comparison between Dask and Ray is that of High-Availability.

Specifically, afaik, Dask can only have a single central scheduler, and as of now it wouldn't be possible to have redundant schedulers, ready to take-over in the case of a scheduler crash. How does Ray compare in this regard?

( @mrocklin correct me if I'm wrong)

@robertnishihara
Copy link
Collaborator

Hi @colobas, scheduling works a bit differently in Ray. Each machine has its own scheduler (which is responsible for managing the workers on that machine), and failures are handled at the granularity of machines, so if the scheduler on a given machine dies, the whole machine is considered to have failed. Objects that were lost and are needed will be recreated by rerunning the tasks that created those objects (specifications of the tasks are stored in a sharded in-memory database, currently we are not resilient to the failure of this database, but we're prototyping a fault tolerance scheme based on chain replication for it).

Currently there are some limitations of the kinds of failures we handle as described in http://ray.readthedocs.io/en/latest/fault-tolerance.html.

cc @stephanie-wang @concretevitamin

jmargeta added a commit to jmargeta/awesome-python that referenced this issue Oct 22, 2018
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
superdev7 pushed a commit to superdev7/python-awesome that referenced this issue Aug 10, 2021
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
paxaxel223 added a commit to paxaxel223/awesome-python that referenced this issue Apr 25, 2023
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
ProActiveSpirit pushed a commit to ProActiveSpirit/awesome-python that referenced this issue Feb 26, 2024
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
hirokisaito912 pushed a commit to hirokisaito912/python that referenced this issue Mar 14, 2024
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
smartluck1125 added a commit to smartluck1125/awesome-python that referenced this issue May 12, 2024
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
mltrev23 added a commit to mltrev23/awesome-python that referenced this issue Sep 14, 2024
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
huntermanwoow added a commit to huntermanwoow/awesome-python that referenced this issue Sep 28, 2024
# What is this Python project?
Ray is a flexible, high-performance distributed execution framework. It achieves parallelism in Python with simple and consistent API.
Ray is particularly suited for machine learning and forms the base of libraries for deep and reinforcement learning, distributing processing of Pandas dataframes, or hyper parameter search.

# What's the difference between this Python project and similar ones?
 - Similar to Dask, see a comparison here:  ray-project/ray#642
 - Allows to efficiently share large numpy arrays (or objects serializable with Arrow) between the processes, without copying the data and with only minimal deserialization
 - Achieves lower latency with bottom up scheduling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants