Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle Loading Problems (EOFError: Ran out of input) #13

Open
golddohyun opened this issue Feb 28, 2023 · 0 comments
Open

Pickle Loading Problems (EOFError: Ran out of input) #13

golddohyun opened this issue Feb 28, 2023 · 0 comments

Comments

@golddohyun
Copy link

I am trying to train the model using vimeo90k dataset, but I get "EOFError: Ran out of input" issue. I was able to train the flow estimator successfully, but this kind of error occurs when training the whole framework. I ran the model with one A6000 GPU and had set the default num_workders as 2. Any ideas..?

  File "/data/projects/chaeyun/VFIformer/models/archs/VFIformer_arch.py", line 346, in __init__
    self.load_networks('flownet', args.resume_flownet)
  File "/data/projects/chaeyun/VFIformer/models/archs/VFIformer_arch.py", line 354, in load_networks
    load_net = torch.load(load_path, map_location=torch.device(self.device))
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 199954) of binary: /home/chaeyun/.conda/envs/vfiformer/bin/python
Traceback (most recent call last):
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Here's my script with arguments :

python -m torch.distributed.launch --nproc_per_node=1 --master_port=4178 train.py --launcher pytorch --gpu_ids 0 --loss_l1 --loss_ter --loss_flow --use_tb_logger --batch_size 128 --net_name VFIformer --name train_VFIformer --max_iter 300 --crop_size 192 --save_epoch_freq 5 --resume_flownet ./weights/train_IFNet/snapshot/net_final.pth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant