regression in 0.5 with pytorch segfault #2447

vlad17 · 2018-07-20T22:36:58Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04)
Ray installed from (source or binary): pip
Ray version: 0.5.0
Python version: 3.5
Exact command to reproduce:

Describe the problem

I hit an issue moving a PyTorch 0.4 model onto a k80 GPU from a tune worker where I was unable to see any error trace: the worker was segfaulting.

I was able to replicate the segfault by invoking the same training function (which is my application code) in the same main file that I started ray with ray.init. As soon as I called model.cuda(), and in particular when a Conv2d module was being moved to the GPU, there was a segfault in the pytorch code at lazy_cuda_init. The only interaction with ray is that ray was initialized in the same process.

When I demote ray to version 0.4 the issue disappears. This was on an AWS p2 instance.

I'll make a minimal example when I have some time, just wanted to post the issue after noticing the ray downgrade resolved the problem.

Source code / logs

to come

@pcmoritz @ericl @richardliaw @robertnishihara

The text was updated successfully, but these errors were encountered:

robertnishihara · 2018-07-20T22:41:25Z

Do you have tensorflow installed? If you install tensorflow does that change the behavior at all?
Does it matter if you import ray or pytorch first?

Could be related to #2391 or #2159.

robertnishihara · 2018-07-21T18:46:17Z

@vlad17 please share the code when you have a chance.

vlad17 · 2018-07-21T20:02:39Z

Please see the inline script and attached install file. Gets a segfault as shown on a machine with a k80 with cuda 9.1 installed. Pretty sure the rllib/tf monkey patch is unnecessary, did not slim down past my personal code deps but I figured you'd rather get replicating code earlier.

conda create -y -n breaking-env python=3.5
source activate breaking-env
./scripts/install-pytorch.sh
pip install ray==0.5 absl-py

# on a cuda 9.1 device
CUDA_VISIBLE_DEVICES=0 python -c '
from absl import app
from absl import flags
import ray
# monkey patch rllib dep to avoid bringing in gym and TF
ray.rllib = None
import ray.tune
from ray.tune import register_trainable, run_experiments

def ray_train(config, status_reporter):
    import torch
    torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

def _main(_):
    ray.init(num_gpus=1)
    ray_train(None, None)

if __name__ == "__main__":
    app.run(_main)
'

install-pytorch.sh.zip

Fatal Python error: Segmentation fault

Thread 0x00007ff3204ae700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/socket.py", line 134 in __init__
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 515 in _connect
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 484 in connect
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 585 in send_packed_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 610 in send_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/client.py", line 667 in execute_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/client.py", line 1347 in lrange
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/ray/worker.py", line 1920 in print_error_messages
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 862 in run
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007ff31fcad700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/ray/worker.py", line 2076 in import_thread
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 862 in run
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 882 in _bootstrap

Current thread 0x00007ff356346700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249 in <lambda>
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182 in _apply
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249 in cuda
  File "<string>", line 12 in ray_train
  File "<string>", line 16 in _main
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/absl/app.py", line 238 in _run_main
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/absl/app.py", line 274 in run
  File "<string>", line 19 in <module>
Segmentation fault (core dumped)

richardliaw · 2018-07-22T00:31:08Z

Works on a Titan Xp; trying to reproduce on a separate env now... is there a difference with just using conda install pytorch torchvision cuda91 -c pytorch?

vlad17 · 2018-07-22T03:34:55Z

@richardliaw still segfaults even w/ that setup. can u replicate on a p2?

richardliaw · 2018-07-22T03:46:07Z

trying now

richardliaw · 2018-07-22T04:30:15Z

Got the segfault on a p2.xlarge. Simply

import ray
import torch
torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

Fails with Torch 0.4 (cuda 9.0) and ray 0.5. Seems to be exactly the same as #2413

richardliaw · 2018-07-22T04:46:43Z

This also fails: @pcmoritz, @robertnishihara

import sys
sys.path.insert(0, "/home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/")
import pyarrow
import torch
print(pyarrow.__file__) # /home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/pyarrow/__init__.py
torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

No error will be thrown if one switches the order of pyarrow and torch importing. The pyarrow on pip works fine though.

pcmoritz · 2018-07-24T21:50:26Z

@richardliaw Which AMI is this using? On the deep learning AMI, this works for me (with the latest master and pytorch from the AMI).

richardliaw · 2018-07-24T22:02:32Z

DL AMI with a new environment and pytorch from pip

…

On Tue, Jul 24, 2018 at 2:50 PM Philipp Moritz ***@***.***> wrote: @richardliaw <https://github.com/richardliaw> Which AMI is this using? On the deep learning AMI, this works for me (with the latest master and pytorch from the AMI). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5Whem7WIdH6_qiePflDN8Igwt10Jks5uJ5algaJpZM4VZWn4> .

pcmoritz · 2018-07-24T22:45:38Z

Ok, if I use the python3 environment in the DL AMI, install pytorch from pip and ray from source, I still can't reproduce it unfortunately. What else could be different?

richardliaw · 2018-07-25T00:06:47Z

Maybe try ray from pip?

…

On Tue, Jul 24, 2018 at 3:45 PM Philipp Moritz ***@***.***> wrote: Ok, if I use the python3 environment in the DL AMI, install pytorch from pip and ray from source, I still can't reproduce it unfortunately. What else could be different? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5fxD1XatrcJlCe1obuBA_GlFweFAks5uJ6OVgaJpZM4VZWn4> .

pcmoritz · 2018-07-25T00:10:49Z

Already tried that, no segfault.

richardliaw · 2018-07-25T00:34:26Z

I used this autoscaler setup:

# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal_2

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 0
max_workers: 0

# docker:
#     image: tensorflow/tensorflow:1.5.0-py3
#     container_name: ray_docker

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1f

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

head_node:
    InstanceType: p2.xlarge
    ImageId: ami-4aa57835

setup_commands: 
    - echo "export PYTHONNOUSERSITE=True" >> ~/.bashrc
    - conda create -y -n breaking-env python=3.5
    - source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py

pcmoritz · 2018-07-26T19:13:21Z

So even with

conda create -y -n breaking-env python=3.5
source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py

and then in IPython:

In [1]: import ray
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

In [2]: import torch

In [3]: torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()
   ...: 
Out[3]: Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

I'm not able to reproduce it. Could it be that the environment of the autoscaler is different in some way (maybe env variables)?

richardliaw · 2018-07-26T19:25:10Z

Looks like you’re not using the same env though...

…

On Thu, Jul 26, 2018 at 12:13 PM Philipp Moritz ***@***.***> wrote: So even with conda create -y -n breaking-env python=3.5 source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py and then in IPython: In [1]: import ray /home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 return f(*args, **kwds) /home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 return f(*args, **kwds) In [2]: import torch In [3]: torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda() ...: Out[3]: Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) I'm not able to reproduce it. Could it be that the environment of the autoscaler is different in some way (maybe env variables)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5eboXqKBAmcfceCGgyFAyYUiXL0Yks5uKhTXgaJpZM4VZWn4> .

pcmoritz · 2018-07-26T19:30:24Z

Good point, IPython doesn't seem to be present in the env:

(breaking-env) ubuntu@ip-172-31-56-152:~$ which ipython
/home/ubuntu/anaconda3/envs/python3/bin/ipython
(breaking-env) ubuntu@ip-172-31-56-152:~$ which python
/home/ubuntu/anaconda3/envs/breaking-env/bin/python
(breaking-env) ubuntu@ip-172-31-56-152:~$

pcmoritz · 2018-07-26T19:31:17Z

With just python it's working yay:

(breaking-env) ubuntu@ip-172-31-56-152:~$ python
Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> import torch
>>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()
Segmentation fault (core dumped)

pcmoritz · 2018-07-26T19:37:12Z

Here is the backtrace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7bc8a99 in __pthread_once_slow (once_control=0x7fffdb227e50 <at::globalContext()::globalContext_+400>, init_routine=0x7fffe4973fe1 <std::__once_proxy()>)
    at pthread_once.c:116
#2  0x00007fffda3f2302 in at::Type::toBackend(at::Backend) const () from /home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/torch/lib/libcaffe2.so
#3  0x00007fffdc031231 in torch::autograd::VariableType::toBackend (this=<optimized out>, b=<optimized out>) at torch/csrc/autograd/generated/VariableType.cpp:145
#4  0x00007fffdc371e8a in torch::autograd::THPVariable_cuda (self=0x7ffff6dbfdc8, args=0x7ffff6daf710, kwargs=0x0) at torch/csrc/autograd/generated/python_variable_methods.cpp:333
#5  0x000055555569f4e8 in PyCFunction_Call ()
#6  0x00005555556f67cc in PyEval_EvalFrameEx ()
#7  0x00005555556fbe08 in PyEval_EvalFrameEx ()
#8  0x00005555556f6e90 in PyEval_EvalFrameEx ()
#9  0x00005555556fbe08 in PyEval_EvalFrameEx ()
#10 0x000055555570103d in PyEval_EvalCodeEx ()
#11 0x0000555555701f5c in PyEval_EvalCode ()
#12 0x000055555575e454 in run_mod ()
#13 0x000055555562ab5e in PyRun_InteractiveOneObject ()
#14 0x000055555562ad01 in PyRun_InteractiveLoopFlags ()
#15 0x000055555562ad62 in PyRun_AnyFileExFlags.cold.2784 ()
#16 0x000055555562b080 in Py_Main.cold.2785 ()
#17 0x000055555562b871 in main ()
(gdb)

So I'm pretty sure it's the same problem that happened with TensorFlow that we deployed a workaround in apache/arrow#2210

I'll open a JIRA in arrow. This is super annoying, I hope we can fix the arrow thread pool altogether, otherwise we will need a similar workaround for pytorch too.

pcmoritz · 2018-07-26T21:56:44Z

This is tough, I can only reproduce it with ray pip installed, not compiled from source. And not with pyarrow from pip (maybe that's too old).

richardliaw · 2018-07-26T22:16:53Z

The pyarrow wrapped in the Ray pip works in reproducing this bug, right?

…

On Thu, Jul 26, 2018 at 2:56 PM Philipp Moritz ***@***.***> wrote: This is tough, I can only reproduce it with ray pip installed, not compiled from source. And not with pyarrow from pip (maybe that's too old). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5f7beURCKpKOQ2cF9eA5ArlmvFOHks5uKjsfgaJpZM4VZWn4> .

pcmoritz · 2018-07-26T22:32:20Z

Yes it does, but only if ray is pip installed (not if locally compiled).

Fortunately now I have also been able to reproduce it with manylinux1 pyarrow wheels compiled from the latest arrow master inside of manylinux1 docker :)

pcmoritz · 2018-07-26T22:36:45Z

Here is the arrow bug report: https://issues.apache.org/jira/browse/ARROW-2920

This fixes ARROW-2920 (see also ray-project/ray#2447) for me Unfortunately we might not be able to have regression tests for this right now because we don't have CUDA in our test toolchain. Author: Philipp Moritz <[email protected]> Closes #2329 from pcmoritz/fix-pytorch-segfault and squashes the following commits: 1d82825 <Philipp Moritz> fix 74bc93e <Philipp Moritz> add note ff14c4d <Philipp Moritz> fix b343ca6 <Philipp Moritz> add regression test 5f0cafa <Philipp Moritz> fix 2751679 <Philipp Moritz> fix 10c5a5c <Philipp Moritz> workaround for pyarrow segfault

vlad17 mentioned this issue Jul 26, 2018

inconsistent error messages #2492

Closed

pcmoritz mentioned this issue Jul 27, 2018

ARROW-2920: [Python] Fix pytorch segfault apache/arrow#2329

Closed

ericl closed this as completed Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression in 0.5 with pytorch segfault #2447

regression in 0.5 with pytorch segfault #2447

vlad17 commented Jul 20, 2018

robertnishihara commented Jul 20, 2018

robertnishihara commented Jul 21, 2018

vlad17 commented Jul 21, 2018 •

edited

Loading

richardliaw commented Jul 22, 2018

vlad17 commented Jul 22, 2018

richardliaw commented Jul 22, 2018

richardliaw commented Jul 22, 2018

richardliaw commented Jul 22, 2018

pcmoritz commented Jul 24, 2018

richardliaw commented Jul 24, 2018 via email

pcmoritz commented Jul 24, 2018

richardliaw commented Jul 25, 2018 via email

pcmoritz commented Jul 25, 2018

richardliaw commented Jul 25, 2018

pcmoritz commented Jul 26, 2018

richardliaw commented Jul 26, 2018 via email

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

richardliaw commented Jul 26, 2018 via email

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

regression in 0.5 with pytorch segfault #2447

regression in 0.5 with pytorch segfault #2447

Comments

vlad17 commented Jul 20, 2018

System information

Describe the problem

Source code / logs

robertnishihara commented Jul 20, 2018

robertnishihara commented Jul 21, 2018

vlad17 commented Jul 21, 2018 • edited Loading

richardliaw commented Jul 22, 2018

vlad17 commented Jul 22, 2018

richardliaw commented Jul 22, 2018

richardliaw commented Jul 22, 2018

richardliaw commented Jul 22, 2018

pcmoritz commented Jul 24, 2018

richardliaw commented Jul 24, 2018 via email

pcmoritz commented Jul 24, 2018

richardliaw commented Jul 25, 2018 via email

pcmoritz commented Jul 25, 2018

richardliaw commented Jul 25, 2018

pcmoritz commented Jul 26, 2018

richardliaw commented Jul 26, 2018 via email

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

richardliaw commented Jul 26, 2018 via email

pcmoritz commented Jul 26, 2018

pcmoritz commented Jul 26, 2018

vlad17 commented Jul 21, 2018 •

edited

Loading