[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

robertnishihara · 2023-01-15T00:36:15Z

What happened + What you expected to happen

I ran the PyTorch documentation example on this page. https://docs.ray.io/en/latest/train/train.html

I uncommented the line that you are supposed to uncomment when you use GPUs.

It fails with the following error message

RayTaskError(RuntimeError): ray::_Inner.train() (pid=839, ip=172.31.175.192, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=935, ip=172.31.175.192, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f07a62d33d0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "<ipython-input-1-8dcfea9e7e7e>", line 41, in train_loop_per_worker
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 530, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3280, in mse_loss
    return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Full output attached output.txt

Versions / Dependencies

Python 3.9
Ray 2.2.0

I got this in two different settings

A single GPU machine running Ray locally
An autoscaling cluster with a m5.2xlarge head node and two worker node types (m5.4xlarge and g4dn.4xlarge).

Reproduction script

This is the code

import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))


def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
# scaling_config = ScalingConfig(num_workers=3)
# If using GPUs, use the below scaling config instead.
scaling_config = ScalingConfig(num_workers=3, use_gpu=True)  # This is the line I uncommented
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

robertnishihara added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2023

richardliaw added P1 Issue that should be fixed within a few weeks train Ray Train Related Issue air docs An issue or change related to documentation and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2023

matthewdeng mentioned this issue Jan 16, 2023

[train][docs] update docstrings/quickstarts to work when use_gpu=True #31692

Merged

7 tasks

matthewdeng self-assigned this Jan 16, 2023

amogkam closed this as completed in #31692 Jan 26, 2023

amogkam mentioned this issue Mar 2, 2023

[data] Refactor DatasetIterator class for cleaner delegation #32928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

robertnishihara commented Jan 15, 2023

[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

Comments

robertnishihara commented Jan 15, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity