Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in which remote function redefinition doesn't happen. #6175

Merged
merged 11 commits into from
Nov 26, 2019

Conversation

robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Nov 16, 2019

  • This should fix Function definitions are not always updated when redefined. #6130.
  • If a user defines the same remote function many (100) times (e.g., from within 100 separate tasks), then a warning will be printed.
  • We also warn for actors.
  • Spurious warnings are possible (e.g., if the something in the remote functions closure changes so the remote function is not actually the same).
  • This only works for Python 3 because it uses the dis module.
  • The behavior is slightly different for Python 3.7+ (hopefully better), but I didn't actually try it out yet.
  • It's possible that dis.dis is sufficiently nondeterministic that we don't always get the warning when we should. However, I haven't noticed this happening.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@robertnishihara robertnishihara changed the title [WIP] Fix bug in which remote function redefinition doesn't happen. Fix bug in which remote function redefinition doesn't happen. Nov 16, 2019
python/ray/function_manager.py Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18530/
Test FAILed.

@robertnishihara
Copy link
Collaborator Author

@edoakes I just tried the following on 3840 cores.

@ray.remote                                                    
def f():
    @ray.remote
    def g():
        return 1
    return ray.get(g.remote())

%time results = ray.get([g.remote() for _ in range(3840 // 2)])

It took 1 minute, which of course is absurdly slow because of the N^2 effect.

One possible way to help (though certainly not the perfect fix) is to detect the scenario where the N^2 effect is happening (e.g., by detecting that many of the remote functions defined have the same source code) and give a good warning to the user (e.g., tell them to pull the remote function definition back to the driver instead of having it defined in the workers.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18526/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18528/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18536/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18537/
Test PASSed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed this looks good, let's just be sure to give a good warning when there are many redefinitions and fix it by pulling from workers in the near future.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18819/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18871/
Test FAILed.

# Source code may not be available:
# e.g. Cython or Python interpreter.
function_source_hash = b""
hasher = hashlib.sha1()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do hashlib.sha1(pickled_function).digest()

"""The identifier is used to detect excessive duplicate exports.

The identifier is used to determine when the same function or class is
exported many times. This can yielf false positives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

collision_identifier = function_or_class.__name__

# Return a hash of the identifier in case it is too large.
hasher = hashlib.sha1()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -30,13 +32,19 @@ class ImportThread(object):
redis_client: the redis client used to query exports.
threads_stopped (threading.Event): A threading event used to signal to
the thread that it should exit.
imported_collision_identifiers: This is a dicitonary mapping strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
imported_collision_identifiers: This is a dicitonary mapping strings
imported_collision_identifiers: This is a dictionary mapping strings

@@ -30,13 +32,19 @@ class ImportThread(object):
redis_client: the redis client used to query exports.
threads_stopped (threading.Event): A threading event used to signal to
the thread that it should exit.
imported_collision_identifiers: This is a dicitonary mapping strings
containing the collision identifier of the remote functions that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings containing the collision identifier of the remote functions that have been exported -> collision identifiers for exported remote functions

imported_collision_identifiers: This is a dicitonary mapping strings
containing the collision identifier of the remote functions that
have been exported to the number of times each remote function has
been imported. this is used to provide good error messages when the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
been imported. this is used to provide good error messages when the
been imported. This is used to provide good error messages when the

containing the collision identifier of the remote functions that
have been exported to the number of times each remote function has
been imported. this is used to provide good error messages when the
same function is exported many many times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
same function is exported many many times.
same function is exported many times.

if key.startswith(b"RemoteFunction"):
collision_identifier, function_name = (
self.redis_client.hmget(
key, ["collision_identifier", "name"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key, ["collision_identifier", "name"]))
key, ["collision_identifier", "function_name"]))

This would be more clear and matches class_name used for actors. If this requires making many changes, don't worry about it.

"https://github.com/ray-project/ray/issues/6240 for more "
"discussion.")

if key.startswith(b"RemoteFunction"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract this whole if/else block into a helper function that returns a collision identifier, type, and name? Better to separate out the redis implementation details from the actual logic.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18891/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18893/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18914/
Test FAILed.

@edoakes edoakes merged commit ffb9c0e into ray-project:master Nov 26, 2019
@robertnishihara robertnishihara deleted the fixredefinition branch November 26, 2019 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Function definitions are not always updated when redefined.
3 participants