[train] Horvod+Torch does not convert GPU tensors to CPU #28439

krfricke · 2022-09-12T10:07:03Z

What happened + What you expected to happen

When using a HorovodTrainer with Torch on GPU, saving checkpoints via state dict leads to deserialization issues on the main trainer worker (which does not hold any GPU resources):

(HorovodTrainer pid=16103)   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(HorovodTrainer pid=16103)     device = validate_cuda_device(location)
(HorovodTrainer pid=16103)   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(HorovodTrainer pid=16103)     raise RuntimeError('Attempting to deserialize object on a CUDA '
(HorovodTrainer pid=16103) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Versions / Dependencies

Latest master

Reproduction script

import torch
from torch import nn

from ray.air import Checkpoint, session, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train.horovod import HorovodTrainer

def train_loop(config):
    net = nn.Linear(in_features=8, out_features=16)
    net.to("cuda")
    
    checkpoint = Checkpoint.from_dict({"model": net.state_dict()})
    session.report({"metric": 1}, checkpoint=checkpoint)

trainer = HorovodTrainer(
    train_loop_per_worker=train_loop,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=1)
)
trainer.fit()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 12, 2022

krfricke self-assigned this Sep 12, 2022

krfricke mentioned this issue Sep 12, 2022

[air] Horovod: Use Torch.encode_data if torch is imported #28440

Merged

7 tasks

krfricke closed this as completed in #28440 Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

krfricke commented Sep 12, 2022

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

Comments

krfricke commented Sep 12, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity