You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bugSomething that is supposed to be working; but isn'tdocsAn issue or change related to documentationP1Issue that should be fixed within a few weekstrainRay Train Related Issue
I uncommented the line that you are supposed to uncomment when you use GPUs.
It fails with the following error message
RayTaskError(RuntimeError): ray::_Inner.train() (pid=839, ip=172.31.175.192, repr=TorchTrainer)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=935, ip=172.31.175.192, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f07a62d33d0>)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
raise skipped from exception_cause(skipped)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "<ipython-input-1-8dcfea9e7e7e>", line 41, in train_loop_per_worker
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3280, in mse_loss
return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
An autoscaling cluster with a m5.2xlarge head node and two worker node types (m5.4xlarge and g4dn.4xlarge).
Reproduction script
This is the code
importtorchimporttorch.nnasnnimportrayfromrayimporttrainfromray.airimportsession, Checkpointfromray.train.torchimportTorchTrainerfromray.air.configimportScalingConfiginput_size=1layer_size=15output_size=1num_epochs=3classNeuralNetwork(nn.Module):
def__init__(self):
super(NeuralNetwork, self).__init__()
self.layer1=nn.Linear(input_size, layer_size)
self.relu=nn.ReLU()
self.layer2=nn.Linear(layer_size, output_size)
defforward(self, input):
returnself.layer2(self.relu(self.layer1(input)))
deftrain_loop_per_worker():
dataset_shard=session.get_dataset_shard("train")
model=NeuralNetwork()
loss_fn=nn.MSELoss()
optimizer=torch.optim.SGD(model.parameters(), lr=0.1)
model=train.torch.prepare_model(model)
forepochinrange(num_epochs):
forbatchesindataset_shard.iter_torch_batches(
batch_size=32, dtypes=torch.float
):
inputs, labels=torch.unsqueeze(batches["x"], 1), batches["y"]
output=model(inputs)
loss=loss_fn(output, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"epoch: {epoch}, loss: {loss.item()}")
session.report(
{},
checkpoint=Checkpoint.from_dict(
dict(epoch=epoch, model=model.state_dict())
),
)
train_dataset=ray.data.from_items([{"x": x, "y": 2*x+1} forxinrange(200)])
# scaling_config = ScalingConfig(num_workers=3)# If using GPUs, use the below scaling config instead.scaling_config=ScalingConfig(num_workers=3, use_gpu=True) # This is the line I uncommentedtrainer=TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
scaling_config=scaling_config,
datasets={"train": train_dataset},
)
result=trainer.fit()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
robertnishihara
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 15, 2023
richardliaw
added
P1
Issue that should be fixed within a few weeks
train
Ray Train Related Issue
air
docs
An issue or change related to documentation
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 15, 2023
bugSomething that is supposed to be working; but isn'tdocsAn issue or change related to documentationP1Issue that should be fixed within a few weekstrainRay Train Related Issue
What happened + What you expected to happen
I ran the PyTorch documentation example on this page. https://docs.ray.io/en/latest/train/train.html
I uncommented the line that you are supposed to uncomment when you use GPUs.
It fails with the following error message
Full output attached output.txt
Versions / Dependencies
I got this in two different settings
Reproduction script
This is the code
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: