You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a HorovodTrainer with Torch on GPU, saving checkpoints via state dict leads to deserialization issues on the main trainer worker (which does not hold any GPU resources):
(HorovodTrainer pid=16103) File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(HorovodTrainer pid=16103) device = validate_cuda_device(location)
(HorovodTrainer pid=16103) File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(HorovodTrainer pid=16103) raise RuntimeError('Attempting to deserialize object on a CUDA '
(HorovodTrainer pid=16103) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Versions / Dependencies
Latest master
Reproduction script
import torch
from torch import nn
from ray.air import Checkpoint, session, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train.horovod import HorovodTrainer
def train_loop(config):
net = nn.Linear(in_features=8, out_features=16)
net.to("cuda")
checkpoint = Checkpoint.from_dict({"model": net.state_dict()})
session.report({"metric": 1}, checkpoint=checkpoint)
trainer = HorovodTrainer(
train_loop_per_worker=train_loop,
scaling_config=ScalingConfig(num_workers=2, use_gpu=1)
)
trainer.fit()
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
krfricke
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Sep 12, 2022
What happened + What you expected to happen
When using a HorovodTrainer with Torch on GPU, saving checkpoints via state dict leads to deserialization issues on the main trainer worker (which does not hold any GPU resources):
Versions / Dependencies
Latest master
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: