Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with defining remote functions and actor classes from within tasks. #6240

Open
robertnishihara opened this issue Nov 22, 2019 · 5 comments
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@robertnishihara
Copy link
Collaborator

robertnishihara commented Nov 22, 2019

Consider the following code.

import ray
ray.init(num_cpus=10)

@ray.remote
def f():
    @ray.remote
    def g():
        return 1
    return ray.get(g.remote())

ray.get([f.remote() for _ in range(10)])

If the 10 copies of f are scheduled on 10 different workers, they will all define g. Each copy of g will be pickled and exported through Redis and then imported by each worker process. So there is an N^2 effect here.

Ideally, we would deduplicate the imports. However, there doesn't appear to be an easy way to determine if the g functions that are exported are actually the same or not. If you just look at the body of the function (e.g., with inspect.getsource), then you will think that two functions are the same if they have the same body but close over different variables in the environment. We can compare the serialized strings generated by cloudpickle, but cloudpickle is nondeterministic, so the same function pickled in different processes will often give rise to different strings. Therefore not enough deduplication will happen.

In #6175, we're settling for not doing any deduplication but giving a warning whenever the source returned by inspect.getsource looks the same.

The longer term solution will likely be to remove the N^2 effect, e.g., perhaps by treating remote functions as objects stored in the object store (instead of Redis) or perhaps by having the workers pull the remote functions from Redis when needed (instead of pushing the remote functions proactively to the workers).

Workaround

Modify the above code to define the remote function on the driver instead. E.g.,

import ray
ray.init(num_cpus=10)

@ray.remote
def g():
    return 1

@ray.remote
def f():
    return ray.get(g.remote())

ray.get([f.remote() for _ in range(10)])

You can look at the different values of len(ray.worker.global_worker.redis_client.lrange('Exports', 0, -1)) produced by the two workloads.

@markgoodhead
Copy link
Contributor

FYI I get this issue for Ray Tune's internals which sent me here:

WARNING import_thread.py:126 -- The actor 'WrappedFunc' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See #6240 for more discussion.

I'm using remote calls within my trainable function (each Tune task has 3k sub tasks) but I'm defining it outside of the Trainable function, like in the second example) so I'm not sure why this would still apply?

@amogkam
Copy link
Contributor

amogkam commented Jul 29, 2020

@markgoodhead WrappedFunc is part of the Tune internals. The warning shouldn't be due to your subtasks.

@rkooo567 rkooo567 self-assigned this Oct 12, 2020
@rkooo567 rkooo567 added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical labels Oct 12, 2020
@yywangvr
Copy link

yywangvr commented Mar 17, 2021

FYI I get this issue for Ray Tune's internals which sent me here:

WARNING import_thread.py:126 -- The actor 'WrappedFunc' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See #6240 for more discussion.

I'm using remote calls within my trainable function (each Tune task has 3k sub tasks) but I'm defining it outside of the Trainable function, like in the second example) so I'm not sure why this would still apply?

I get the same warning; how can I solve it?

@oakkas
Copy link

oakkas commented Sep 20, 2021

I also have the same issue. I defined two functions which one remote task calls other one 100s of times depending on the number of rows.

@jmakov
Copy link
Contributor

jmakov commented Sep 23, 2021

Got the same warning with Tune. No ray calls in my tunable function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

7 participants