You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have written some PyTorch Lightning code for which I am using torchmetrics to evaluate. Locally on my CPU everything works fine. However, when I move my script to a cluster, and try to run it on a single GPU, I run into problems. Specifically, it seems like my error metrics are not being calculated properly (see attached image). What could be the cause here?
To Reproduce
I am loading my data with the LinkNeighbourLoader provided by PyG. I suspect the observed behaviour might be partially attributable to the way LinkNeighbourLoader sends (or doesn't send?) the data to the correct device. Torchmetrics should handle this out-of-the-box, but it might be that LinkNeighbourLoader doesn't integrate properly here.
With similar loaders for my test and val data. Then, I initialise my metrics in where I am defining the Lightning training and testing logic as follows:
The logic is similar for my training and testing steps. This leads to the loss curves as displayed in the attached figure, where green is the run on HPC cluster GPUs and red locally on CPU.
Expected behavior
The metrics to be logged at each epoch properly.
Environment
Python environment:
Python==3.10.7
PyTorch==2.2.0+cu118
cudnn==8.9.4
cuda==12.2
gcc==12.1.0
lightning==2.1.3
torchmetrics== 1.3.0.post0
torch_geometric==2.4.0
pyg-lib==0.4.0
OS:
-CentOS Linux 7 (Core)
Additional context
Loss curves:
The text was updated successfully, but these errors were encountered:
Hi @aaronwtr, thanks for raising this issue.
Have you solved the issue or do it still persist? I think a bit more information is needed here. The logging behavior should really not change when you move from one device to another (even in this case another computer system). I wonder if it is the logging that is going wrong here or is it the actually training that is going wrong e.g. if you just print to the terminal do you still produce constant metric values?
🐛 Bug
I have written some PyTorch Lightning code for which I am using torchmetrics to evaluate. Locally on my CPU everything works fine. However, when I move my script to a cluster, and try to run it on a single GPU, I run into problems. Specifically, it seems like my error metrics are not being calculated properly (see attached image). What could be the cause here?
To Reproduce
I am loading my data with the
LinkNeighbourLoader
provided by PyG. I suspect the observed behaviour might be partially attributable to the wayLinkNeighbourLoader
sends (or doesn't send?) the data to the correct device. Torchmetrics should handle this out-of-the-box, but it might be thatLinkNeighbourLoader
doesn't integrate properly here.I load my data as follows:
With similar loaders for my test and val data. Then, I initialise my metrics in where I am defining the Lightning training and testing logic as follows:
E.g., in my validation step, I log the metrics as follows:
The logic is similar for my training and testing steps. This leads to the loss curves as displayed in the attached figure, where green is the run on HPC cluster GPUs and red locally on CPU.
Expected behavior
The metrics to be logged at each epoch properly.
Environment
Python environment:
OS:
-CentOS Linux 7 (Core)
Additional context
Loss curves:
The text was updated successfully, but these errors were encountered: