evaluation getting stuck when using --im_eval with num_envs>1 #17

kinalmehta · 2023-12-14T05:37:24Z

When evaluating on amass with the following command, the code gets stuck at the tqdm progress bar:

python phc/run.py --task HumanoidImMCPGetup --cfg_env phc/data/cfg/phc_shape_mcp_iccv.yaml --cfg_train phc/data/cfg/train/rlg/im_mcp.yaml --motion_file sample_data/amass_isaac_eval.pkl --network_path output/phc_shape_mcp_iccv --test --num_envs 100 --epoch -1 --no_virtual_display --im_eval

However, it works completely fine when running with --num_envs 1.
Library versions in use

python                    3.8
torch                     2.1.1
torchaudio                2.1.1
torchgeometry             0.1.2
torchmetrics              1.2.0
torchvision               0.16.1
tqdm                      4.66.1

The text was updated successfully, but these errors were encountered:

ZhengyiLuo · 2023-12-14T21:41:21Z

Looks like a multi-processing error; those can be relatively finicky. It can either be the at the data loader part or the robot creation part.

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid

kinalmehta · 2023-12-15T04:10:25Z

Yes. I did the debugging and it's exactly as you say. It is getting stuck at data loading part.

Any specific reason for this issue. Any pointers which I can refer to solve?

ZhengyiLuo · 2023-12-15T15:53:53Z

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid?

Basically, disable multi-processing. How many cores machine do you use?

kinalmehta · 2023-12-16T06:16:55Z

It only works after setting num_jobs=1 in both the places mentioned by you. Not making this change in either one causes the issue.

I tried this on 2 systems:

Ubuntu 20.04 in a 48-core system
Fedora 39 in a 16-core system

Edit:

Another thing I notice is that humanoid uses python native multiprocessing and motion_lib_base uses torch.multiprocessing

Could this issue be caused by combining these two?

noahcao · 2023-12-31T03:01:17Z

@kinalmehta Not very likely. In my experience, torch.multiprocessing is a wrapper of the python multiprocessing lib by adding some customized functions and APIs and the mixture of using both typically does not cause an issue.

By replacing multiprocessing with torch.multiprocessing, can you work around this issue?

kinalmehta · 2023-12-31T04:49:26Z

Hi @noahcao
Thanks for suggestion.
I tried doing that. But still the problem persists.
I'm unable to find a solution to this. The torch.multiprocessing docs mention that python implementation is deadlock free however the torch version can run into deadlocks. And no solution to this is mentioned.

ZhengyiLuo · 2024-01-24T20:01:28Z

For the data loading part, try use

at this line, bascially, uncomment:

mp.set_sharing_strategy('file_system')

which should fix the issue. Though using file_system has caused me problems before as well...

ZhengyiLuo · 2024-03-06T19:46:19Z

Does export OMP_NUM_THREADS=1 solves this issue on your ends?

kinalmehta · 2024-03-07T04:53:38Z

yes!! This solved the issue.

Thanks a lot. :D

ZhengyiLuo closed this as completed Jan 24, 2024

ZhengyiLuo mentioned this issue Feb 26, 2024

Re model types #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation getting stuck when using --im_eval with num_envs>1 #17

evaluation getting stuck when using --im_eval with num_envs>1 #17

kinalmehta commented Dec 14, 2023

ZhengyiLuo commented Dec 14, 2023

kinalmehta commented Dec 15, 2023

ZhengyiLuo commented Dec 15, 2023

kinalmehta commented Dec 16, 2023 •

edited

Loading

noahcao commented Dec 31, 2023

kinalmehta commented Dec 31, 2023

ZhengyiLuo commented Jan 24, 2024

ZhengyiLuo commented Mar 6, 2024

kinalmehta commented Mar 7, 2024

evaluation getting stuck when using --im_eval with num_envs>1 #17

evaluation getting stuck when using --im_eval with num_envs>1 #17

Comments

kinalmehta commented Dec 14, 2023

ZhengyiLuo commented Dec 14, 2023

kinalmehta commented Dec 15, 2023

ZhengyiLuo commented Dec 15, 2023

kinalmehta commented Dec 16, 2023 • edited Loading

noahcao commented Dec 31, 2023

kinalmehta commented Dec 31, 2023

ZhengyiLuo commented Jan 24, 2024

ZhengyiLuo commented Mar 6, 2024

kinalmehta commented Mar 7, 2024

kinalmehta commented Dec 16, 2023 •

edited

Loading