-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression in 0.5 with pytorch segfault #2447
Comments
@vlad17 please share the code when you have a chance. |
Please see the inline script and attached install file. Gets a segfault as shown on a machine with a k80 with cuda 9.1 installed. Pretty sure the rllib/tf monkey patch is unnecessary, did not slim down past my personal code deps but I figured you'd rather get replicating code earlier.
|
Works on a Titan Xp; trying to reproduce on a separate env now... is there a difference with just using |
@richardliaw still segfaults even w/ that setup. can u replicate on a p2? |
trying now |
Got the segfault on a p2.xlarge. Simply
Fails with Torch 0.4 (cuda 9.0) and ray 0.5. Seems to be exactly the same as #2413 |
This also fails: @pcmoritz, @robertnishihara import sys
sys.path.insert(0, "/home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/")
import pyarrow
import torch
print(pyarrow.__file__) # /home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/pyarrow/__init__.py
torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda() No error will be thrown if one switches the order of pyarrow and torch importing. The |
@richardliaw Which AMI is this using? On the deep learning AMI, this works for me (with the latest master and pytorch from the AMI). |
DL AMI with a new environment and pytorch from pip
…On Tue, Jul 24, 2018 at 2:50 PM Philipp Moritz ***@***.***> wrote:
@richardliaw <https://github.com/richardliaw> Which AMI is this using? On
the deep learning AMI, this works for me (with the latest master and
pytorch from the AMI).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5Whem7WIdH6_qiePflDN8Igwt10Jks5uJ5algaJpZM4VZWn4>
.
|
Ok, if I use the python3 environment in the DL AMI, install pytorch from pip and ray from source, I still can't reproduce it unfortunately. What else could be different? |
Maybe try ray from pip?
…On Tue, Jul 24, 2018 at 3:45 PM Philipp Moritz ***@***.***> wrote:
Ok, if I use the python3 environment in the DL AMI, install pytorch from
pip and ray from source, I still can't reproduce it unfortunately. What
else could be different?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5fxD1XatrcJlCe1obuBA_GlFweFAks5uJ6OVgaJpZM4VZWn4>
.
|
Already tried that, no segfault. |
I used this autoscaler setup:
|
So even with
and then in IPython:
I'm not able to reproduce it. Could it be that the environment of the autoscaler is different in some way (maybe env variables)? |
Looks like you’re not using the same env though...
…On Thu, Jul 26, 2018 at 12:13 PM Philipp Moritz ***@***.***> wrote:
So even with
conda create -y -n breaking-env python=3.5
source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py
and then in IPython:
In [1]: import ray
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
In [2]: import torch
In [3]: torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()
...:
Out[3]: Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
I'm not able to reproduce it. Could it be that the environment of the
autoscaler is different in some way (maybe env variables)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5eboXqKBAmcfceCGgyFAyYUiXL0Yks5uKhTXgaJpZM4VZWn4>
.
|
Good point, IPython doesn't seem to be present in the env:
|
With just
|
Here is the backtrace:
So I'm pretty sure it's the same problem that happened with TensorFlow that we deployed a workaround in apache/arrow#2210 I'll open a JIRA in arrow. This is super annoying, I hope we can fix the arrow thread pool altogether, otherwise we will need a similar workaround for pytorch too. |
This is tough, I can only reproduce it with ray pip installed, not compiled from source. And not with pyarrow from pip (maybe that's too old). |
The pyarrow wrapped in the Ray pip works in reproducing this bug, right?
…On Thu, Jul 26, 2018 at 2:56 PM Philipp Moritz ***@***.***> wrote:
This is tough, I can only reproduce it with ray pip installed, not
compiled from source. And not with pyarrow from pip (maybe that's too old).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUc5f7beURCKpKOQ2cF9eA5ArlmvFOHks5uKjsfgaJpZM4VZWn4>
.
|
Yes it does, but only if ray is pip installed (not if locally compiled). Fortunately now I have also been able to reproduce it with manylinux1 pyarrow wheels compiled from the latest arrow master inside of manylinux1 docker :) |
Here is the arrow bug report: https://issues.apache.org/jira/browse/ARROW-2920 |
This fixes ARROW-2920 (see also ray-project/ray#2447) for me Unfortunately we might not be able to have regression tests for this right now because we don't have CUDA in our test toolchain. Author: Philipp Moritz <[email protected]> Closes #2329 from pcmoritz/fix-pytorch-segfault and squashes the following commits: 1d82825 <Philipp Moritz> fix 74bc93e <Philipp Moritz> add note ff14c4d <Philipp Moritz> fix b343ca6 <Philipp Moritz> add regression test 5f0cafa <Philipp Moritz> fix 2751679 <Philipp Moritz> fix 10c5a5c <Philipp Moritz> workaround for pyarrow segfault
System information
Describe the problem
I hit an issue moving a PyTorch 0.4 model onto a k80 GPU from a tune worker where I was unable to see any error trace: the worker was segfaulting.
I was able to replicate the segfault by invoking the same training function (which is my application code) in the same main file that I started ray with
ray.init
. As soon as I called model.cuda(), and in particular when aConv2d
module was being moved to the GPU, there was a segfault in the pytorch code atlazy_cuda_init
. The only interaction with ray is that ray was initialized in the same process.When I demote ray to version 0.4 the issue disappears. This was on an AWS p2 instance.
I'll make a minimal example when I have some time, just wanted to post the issue after noticing the ray downgrade resolved the problem.
Source code / logs
to come
@pcmoritz @ericl @richardliaw @robertnishihara
The text was updated successfully, but these errors were encountered: