Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluation getting stuck when using --im_eval with num_envs>1 #17

Closed
kinalmehta opened this issue Dec 14, 2023 · 9 comments
Closed

evaluation getting stuck when using --im_eval with num_envs>1 #17

kinalmehta opened this issue Dec 14, 2023 · 9 comments

Comments

@kinalmehta
Copy link

When evaluating on amass with the following command, the code gets stuck at the tqdm progress bar:

python phc/run.py --task HumanoidImMCPGetup --cfg_env phc/data/cfg/phc_shape_mcp_iccv.yaml --cfg_train phc/data/cfg/train/rlg/im_mcp.yaml --motion_file sample_data/amass_isaac_eval.pkl --network_path output/phc_shape_mcp_iccv --test --num_envs 100 --epoch -1 --no_virtual_display --im_eval

However, it works completely fine when running with --num_envs 1.
Library versions in use

python                    3.8
torch                     2.1.1
torchaudio                2.1.1
torchgeometry             0.1.2
torchmetrics              1.2.0
torchvision               0.16.1
tqdm                      4.66.1
@ZhengyiLuo
Copy link
Owner

Looks like a multi-processing error; those can be relatively finicky. It can either be the at the data loader part or the robot creation part.

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid

@kinalmehta
Copy link
Author

Yes. I did the debugging and it's exactly as you say. It is getting stuck at data loading part.

Any specific reason for this issue. Any pointers which I can refer to solve?

@ZhengyiLuo
Copy link
Owner

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid?

Basically, disable multi-processing. How many cores machine do you use?

@kinalmehta
Copy link
Author

kinalmehta commented Dec 16, 2023

It only works after setting num_jobs=1 in both the places mentioned by you. Not making this change in either one causes the issue.

I tried this on 2 systems:

  1. Ubuntu 20.04 in a 48-core system
  2. Fedora 39 in a 16-core system

Edit:

Another thing I notice is that humanoid uses python native multiprocessing and motion_lib_base uses torch.multiprocessing

Could this issue be caused by combining these two?

@noahcao
Copy link

noahcao commented Dec 31, 2023

@kinalmehta Not very likely. In my experience, torch.multiprocessing is a wrapper of the python multiprocessing lib by adding some customized functions and APIs and the mixture of using both typically does not cause an issue.

By replacing multiprocessing with torch.multiprocessing, can you work around this issue?

@kinalmehta
Copy link
Author

Hi @noahcao
Thanks for suggestion.
I tried doing that. But still the problem persists.
I'm unable to find a solution to this. The torch.multiprocessing docs mention that python implementation is deadlock free however the torch version can run into deadlocks. And no solution to this is mentioned.

@ZhengyiLuo
Copy link
Owner

For the data loading part, try use

at this line, bascially, uncomment:

mp.set_sharing_strategy('file_system')

which should fix the issue. Though using file_system has caused me problems before as well...

@ZhengyiLuo
Copy link
Owner

Does export OMP_NUM_THREADS=1 solves this issue on your ends?

@kinalmehta
Copy link
Author

yes!! This solved the issue.

Thanks a lot. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants