Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

Closed
krfricke opened this issue Sep 12, 2022 · 0 comments · Fixed by #28440
Closed

[train] Horvod+Torch does not convert GPU tensors to CPU #28439

krfricke opened this issue Sep 12, 2022 · 0 comments · Fixed by #28440
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks

Comments

@krfricke
Copy link
Contributor

What happened + What you expected to happen

When using a HorovodTrainer with Torch on GPU, saving checkpoints via state dict leads to deserialization issues on the main trainer worker (which does not hold any GPU resources):

(HorovodTrainer pid=16103)   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(HorovodTrainer pid=16103)     device = validate_cuda_device(location)
(HorovodTrainer pid=16103)   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(HorovodTrainer pid=16103)     raise RuntimeError('Attempting to deserialize object on a CUDA '
(HorovodTrainer pid=16103) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Versions / Dependencies

Latest master

Reproduction script

import torch
from torch import nn

from ray.air import Checkpoint, session, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train.horovod import HorovodTrainer

def train_loop(config):
    net = nn.Linear(in_features=8, out_features=16)
    net.to("cuda")
    
    checkpoint = Checkpoint.from_dict({"model": net.state_dict()})
    session.report({"metric": 1}, checkpoint=checkpoint)

trainer = HorovodTrainer(
    train_loop_per_worker=train_loop,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=1)
)
trainer.fit()

Issue Severity

High: It blocks me from completing my task.

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 12, 2022
@krfricke krfricke self-assigned this Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant