-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032
Conversation
@shingjan please take a look on this PR. After #8492 the tuning doesn't work in proper way. It hangs. The problem is in the resources management and global dictionary with arguments. The issue is that global arguments don't release managed resources, and the RPC hangs till they won't be released. |
|
Yes, the hang related to the global dictionary. In this case I have another workaround how to fix the hang. In the assert len(args) == len(build_res.args)
+ import copy
+ loc_args = copy.deepcopy(args)
# pylint: disable=consider-using-enumerate
- for idx in range(len(args)):
- if args[idx] is None:
+ for idx in range(len(loc_args)):
+ if loc_args[idx] is None:
build_res_arg = build_res.args[idx]
empty_array = ndarray.empty(
get_const_tuple(build_res_arg.shape), build_res_arg.dtype, dev
)
random_fill(empty_array)
- args[idx] = empty_array
+ loc_args[idx] = empty_array
else:
- args[idx] = ndarray.array(args[idx], dev)
+ args[idx] = ndarray.array(args[idx], dev) I don't need this workaround due to it is necessary to do a deep copy of arguments. What do you think about it? |
Sorry I’m a bit confused. Is it because |
I think so. In case when the
I tried to find the exact resource which hold the resources and leads to hang. But didn't find it. Now I thought about it and I suppose it can be the empty ndarrays which were created. But I have to check it. Maybe the good solution will be to regenerate such arguments which were not predefined for each run. |
I think this is already the case. Currently, in the main process side, we look up the dictionary via |
Who will guarantee that the arrays will be destroyed? We have a global variable which was passed to the function and in this function new arrays were created in this global dictionary. When the program was run in the device, these arrays won't be destroyed due to they are the part of global object. Or I'm wrong? I updated this PR with another fix for this problem. I just create a local copy of the arguments in Another workable fix is to set arguments to assert len(args) == len(build_res.args)
+ indices = []
# pylint: disable=consider-using-enumerate
for idx in range(len(args)):
if args[idx] is None:
build_res_arg = build_res.args[idx]
empty_array = ndarray.empty(
get_const_tuple(build_res_arg.shape), build_res_arg.dtype, dev
)
random_fill(empty_array)
+ indices.append(idx)
args[idx] = empty_array
else:
args[idx] = ndarray.array(args[idx], dev)
dev.sync()
# First run for check that the kernel is correct
func.entry_func(*args)
dev.sync()
costs = time_f(*args).results
+ for idx in indices:
+ args[idx] = None In this case, the arrays also will be destroyed. What about steps to reproduce this issue:
|
I see. If I understand correctly, the issue is when popen worker got a copy of |
Yes, I did it and it works fine. I wrote this code with indices just because I wasn't sure that I can set |
Alternatively we can try creating a local array without deep copy. I think the issue if the arguments sent to Popen workers are not freed immediately (not sure why), if so creating another array should work
|
@vinx13 sorry for the delay, I was busy at the weekends. I updated the PR and added fix with local arguments. Could you please take a look once again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Probably need to re-trigger PI. BTW, is this an issue that PopenPoolExecutor
will always have with parameter/argument passing?
@shingjan not sure why args sent to popenworker is not immediately GC'ed |
0b95b88
to
4c12e6e
Compare
cd14bbb
to
80df07a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some minor change needed
Sorry, only now I saw that you added the comment in this PR. Thank you for the change. |
80ee67c
to
c53badd
Compare
* [Auto-Schedule][Fix] Fix hang while tune model through rpc * Fix problem with hang by using deep copy * Fix with local args * Update python/tvm/auto_scheduler/measure.py Co-authored-by: Wuwei Lin <[email protected]>
* [Auto-Schedule][Fix] Fix hang while tune model through rpc * Fix problem with hang by using deep copy * Fix with local args * Update python/tvm/auto_scheduler/measure.py Co-authored-by: Wuwei Lin <[email protected]>
No description provided.