-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FunctionTask hashes depending on imports #457
Comments
@JW-96 - this is both a bug/not a bug, but the full solution is complicated and would need some work. the way function hashing works right now is to cloudpickle the function and hash the pickled state. so the hash is going to be dependent on your exact environment and the pickling process and will not survive if packages are updated or the function is switched. the nice thing though is that the pickled function will execute in a similar environment. we were contemplating a more source based operation similar to what nipype uses. however, this would require significant function introspection and we were considering the dill package to do so. for this part we would hash the source code of the function up to some prescribed depth and recreate the function on a rerun. this would involve detecting functions and dependencies and somehow checking which to reconstruct and which can be relied on from libraries. |
@satra - I wonder if we should be returning an error if imports are done outside the function |
The hashes are not only inconsistent if functions are not imported, but also if other global variables are used inside the function.
I think this is more of a problem with Python itself. It should enforce some order in __globals__. |
We cannot raise errors since a key goal is to let users reuse functions defined in any library. We are not going to change the python ecosystem to put imports into functions. However, we can do the technical task of introspecting functions to do the kind of state preservation we need. For example,
|
I might be missing something, but would submitting a patch to sort I'd think we could patch here: f_globals_ref = _extract_code_globals(func.__code__)
- f_globals = {k: func.__globals__[k] for k in f_globals_ref if k in
+ f_globals = {k: func.__globals__[k] for k in sorted(f_globals_ref) if k in
func.__globals__} |
Yes, it would solve the current issue in this thread, and we should do that. But this will not address the general problem of hashing functions across environments. Also at this point cloud pickle explicitly states that the pickle should not be seen as consistent on their readme. So having some other options in the pipe would be good. |
I see I didn't read carefully enough. You're suggesting to replace cloudpickle with our own hand-rolled version? |
Eventually, for now your solution would cover this use case. I wouldn't have a problem to do this in cloudpickle itself, but they may consider it out of scope. We should set up a chat with olivier. |
Took a while to figure out how to write a test, but: cloudpipe/cloudpickle#418. |
let me take that back - this won't take care of globals from different modules i think. |
|
btw, this cloudpickle PR may also help (cloudpipe/cloudpickle#417) |
cloudpipe/cloudpickle#428 is in, so any cloudpickle >=1.6.1 should not have the globals non-determinism. That PR does use insertion order, rather than sorting, so it's conceivable that non-determinism could work its way back in, but the tests indicate that (for now) it's resilient to the initial random state. |
nice! any ideas of release schedule? |
Nope. It's been a bit. It looks like they're kind of active at the moment, so hopefully this burst will be followed by a release. |
Released in 2.0. Upgrading cloudpickle dependency will close this. |
What version of Pydra are you using?
Pydra version: 0.14.1,
Python version: 3.8.5
This is a follow up to #455.
I found another way to create inconsistent hash values for a FunctionTask. I came across it after having used the work around for the first bug in my project. A decorated function that uses several functions that were imported at the beginning of the module had inconsistent hash values too.
I could replicate the behaviour with the following minimal example.
I observed how the hash randomly alternates among two values when rerunning the above programs.
The issue does not occur if the two imports happen inside the gunzip_alt function:
I also did not observe the issue, if only one imported function was used. If three imported functions were used inside the gunzip_alt function, then I observed it alternating between six hashes.
When comparing the printed inputs of task4, I only notices changes at the end of the functions' byte representation (_func argument) after the __global__ keyword. I suspect that python randomly chooses the order in which it references to these imported functions inside __global__. Hence, the numbers 1, 2, 6 from the factorial sequence.
I am not sure if this is a bug, or whether it should be required from users to import all used functions inside the decorated function.
The text was updated successfully, but these errors were encountered: