Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Ray Train PyTorch documentation example does not work out of the box with GPUs #31684

Closed
robertnishihara opened this issue Jan 15, 2023 · 0 comments · Fixed by #31692
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@robertnishihara
Copy link
Collaborator

What happened + What you expected to happen

I ran the PyTorch documentation example on this page. https://docs.ray.io/en/latest/train/train.html

I uncommented the line that you are supposed to uncomment when you use GPUs.

It fails with the following error message

RayTaskError(RuntimeError): ray::_Inner.train() (pid=839, ip=172.31.175.192, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=935, ip=172.31.175.192, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f07a62d33d0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "<ipython-input-1-8dcfea9e7e7e>", line 41, in train_loop_per_worker
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 530, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3280, in mse_loss
    return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Full output attached output.txt

Versions / Dependencies

  • Python 3.9
  • Ray 2.2.0

I got this in two different settings

  1. A single GPU machine running Ray locally
  2. An autoscaling cluster with a m5.2xlarge head node and two worker node types (m5.4xlarge and g4dn.4xlarge).

Reproduction script

This is the code

import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))


def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
# scaling_config = ScalingConfig(num_workers=3)
# If using GPUs, use the below scaling config instead.
scaling_config = ScalingConfig(num_workers=3, use_gpu=True)  # This is the line I uncommented
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@robertnishihara robertnishihara added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2023
@richardliaw richardliaw added P1 Issue that should be fixed within a few weeks train Ray Train Related Issue air docs An issue or change related to documentation and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 15, 2023
@matthewdeng matthewdeng self-assigned this Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants