Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when copying PyTorch tensor to cuda just by importing Ray #2413

Closed
floringogianu opened this issue Jul 17, 2018 · 5 comments

Comments

@floringogianu
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 LTS x86_64
  • Ray installed from (source or binary): binary (pip install ray)
  • Ray version: 0.5.0
  • Python version: Python 3.6.0 :: Continuum Analytics, Inc.
  • PyTorch version: 0.4.0
  • CUDA version: 9.1
  • Exact command to reproduce: python main.py, where main.py is:
import ray
import torch


def main():
    x = torch.rand(10).cuda()


if __name__ == '__main__':
    main()

Describe the problem

The code above results in a:

[1]    24554 segmentation fault (core dumped)  python main.py

I had a good initial experience toying with Ray and PyTorch, was running some benchmarks when I decided to check the CUDA support. Is Ray compatible with PyTorch CudaTensors?

Source code / logs

/tmp/ray and /tmp/raylogs are empty.

@floringogianu
Copy link
Author

By changing the order of torch and ray it stops seg-faulting. However I ran into another issue. Basically torch.cuda() operations seem unsupported at this point. Should I open a separate issue on this?

@robertnishihara
Copy link
Collaborator

Interesting, thanks for reporting this! It looks very related to #2159.

Out of curiosity, do you have tensorflow installed or not?

If you do import tensorflow before import ray, does it still segfault?

@robertnishihara
Copy link
Collaborator

@floringogianu can you clarify what you mean by torch.cuda() operations are unsupported?

Could you perhaps try registering a custom serializer/deserializer as in #1856 (comment) (though this may need to be updated for more recent versions of pytorch)? Actually, I think we already do something like this in https://github.com/apache/arrow/blob/4ba8769b4858dcd46a7ea7e40bd6c10102327a0d/python/pyarrow/serialization.py#L131-L153, but maybe we aren't registering serializers for the cuda equivalents.

@floringogianu
Copy link
Author

@robertnishihara thanks for the quick reply, much appreciated. I took me a while to replay back because ray really got me hooked today, dabbled with it the entire day, I like it a lot! :)

Back to the issue: I created a separate conda virtualenv and installed tensorflow, the problem didn't reproduce. I then installed pytorch in this new env and again the segfault didn't reproduce. I returned to the virtualenv I used yesterday and again, no segfaults. I have no explanation for what is happening. Yesterday I crashed ray a lot and maybe some processes got stuck in memory causing the segfaults and the weird crashes and behavior I was experiencing when trying to use torch.cuda objects and operations.

In the examples I was toying with yesterday I didn't need registering new serializers because I was taking care to simply pass or return numpy objects.

I will close this issue for now and reopen it only if I get the segfaults again.

@robertnishihara
Copy link
Collaborator

Ok sounds good. Definitely reopen this if the issue occurs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants