Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Docker] xgboost 1.0.1 causes segfault on test_autotvm_xgboost_model.py #4953

Closed
leandron opened this issue Feb 27, 2020 · 61 comments
Closed

Comments

@leandron
Copy link
Contributor

With the release of XGBoost 1.0.x (i.e xgboost-1.0.1-py3-none-manylinux1_x86_64.whl), it seems that installing TVM from scratch (rebuilding Docker containers) makes tests/python/unittest/test_autotvm_xgboost_model.py to fail with a segfault.

Investigating it a bit further, if I manually revert it to xgboost-0.90 it works fine. Using xgboost-1.0.1, this is the message I see:

tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault

Thread 0x00007f4f98de4700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 463 in _handle_results
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f4f905e3700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f4f8fde2700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f4fb514c700 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/xgboost/core.py", line 1248 in update
  File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 74 in _train_internal
  File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 209 in train
  File "/workspace/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
  File "/workspace/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
  File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 185 in call_and_report
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main
  File "/usr/local/lib/python3.6/dist-packages/pytest/__main__.py", line 7 in <module>
  File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
  File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
./tests/scripts/task_python_unittest.sh: line 27: 24582 Segmentation fault      (core dumped) TVM_FFI=ctypes python3 -m pytest -v tests/python/unittest

@tqchen, I didn't see any PR or discussion about it, but are you aware about any ongoing initiative to move TVM to XGBoost 1.0.x, or shall we pin xgboost to be 0.90, to prevent the error to happen? (note: I'm happy to send a patch to pin the version)

@tqchen
Copy link
Member

tqchen commented Feb 27, 2020

@leandron can you create a minimum reproducible example? e.g. pickle the data that causes segfault in XGBoost. Then we can start to bring attention of the XGBoost dev community. In the meanwhile, we can pin xgboost to 0.9

@tqchen
Copy link
Member

tqchen commented Feb 27, 2020

cc @merrymercy @hcho3 Who might be interested in this issue

@hcho3
Copy link
Contributor

hcho3 commented Feb 27, 2020

It would be nice if we can get a reproducible example. We are currently working on the patch release 1.0.2 and I want to get a patch to fix this issue.

tqchen pushed a commit that referenced this issue Mar 3, 2020
* Sets xgboost dependency to be 0.90, preventing
   segfaults during TVM python unit tests execution

 * This is discussed in issue #4953
@tqchen
Copy link
Member

tqchen commented Mar 11, 2020

@leandron @hcho3 please followup :)

@tqchen
Copy link
Member

tqchen commented Mar 30, 2020

@leandron can you please comment about the current state?

@leandron
Copy link
Contributor Author

leandron commented Apr 3, 2020

Hi, I only managed to investigate it further, today. XGBoost now is version 1.0.2, and I can still reproduce this issue.

To give some context, this is the function call that triggers the issue:
https://github.com/apache/incubator-tvm/blob/54975a3fd24fa45b815be39075f4614e53009444/python/tvm/autotvm/tuner/xgboost_cost_model.py#L262-L272

I tried just creating pickle files of some inputs (self.xgb_params and dtrain) and simplifying the function call, but this is not enough to reproduce the issue. The issue seems to be in the context custom_callback, below:

https://github.com/apache/incubator-tvm/blob/54975a3fd24fa45b815be39075f4614e53009444/python/tvm/autotvm/tuner/xgboost_cost_model.py#L419-L526

Now, something that could help me a bit to narrow down where the problem is if I run XGBoost in debug mode. @hcho3 what is the simplest way I can do that?

@hcho3
Copy link
Contributor

hcho3 commented Apr 7, 2020

@leandron Thanks for pointing out which part of TVM test is failing. Not sure if running in debug mode would help, since XGBoost is crashing with segfault here. I will take a look some time this week.

trevor-m pushed a commit to trevor-m/tvm that referenced this issue Apr 16, 2020
* Sets xgboost dependency to be 0.90, preventing
   segfaults during TVM python unit tests execution

 * This is discussed in issue apache#4953
zhiics pushed a commit to neo-ai/tvm that referenced this issue Apr 17, 2020
* Sets xgboost dependency to be 0.90, preventing
   segfaults during TVM python unit tests execution

 * This is discussed in issue apache#4953
@hcho3
Copy link
Contributor

hcho3 commented Apr 20, 2020

I compiled TVM from source and tried running the test tests/python/unittest/test_autotvm_xgboost_model.py::test_fit and I cannot reproduce the issue. Do I need a specific Docker container to reproduce the problem?

@hcho3
Copy link
Contributor

hcho3 commented Apr 20, 2020

I also tried building ci-cpu Docker image from scratch and running the unit test inside the container. The test runs without crashing.

@leandron
Copy link
Contributor Author

We applied a workaround, pinning the xgboost version to be 0.90. Which XGBoost version you see in the image you created from scratch?

@hcho3
Copy link
Contributor

hcho3 commented Apr 20, 2020

@leandron I checked out commit 8502691 so that XGBoost 1.0.x is used.

@tqchen
Copy link
Member

tqchen commented Apr 22, 2020

@leandron can you also provide a bit more details

e.g. does directly run tests/python/unittest/test_autotvm_xgboost_model.py fails or do we need to run the entire unittest. It would also be nice if you can send a CI binary hashtag(perhaps in docker hub) to confirm the problematic issue.

I tried to build a docker image with xgboost==1.0.2 and seems cannot repro the issue.

@areusch
Copy link
Contributor

areusch commented Apr 22, 2020

I see this reliably with a virtualenv on bionic on AWS.

environment:
ami: 099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408
using tvm revision: 72f2aea2dd219bf55c15b3cf4cfc21491f1f60dd

command: TVM_FFI=ctypes python3 -m pytest -s -v tests/python/unittest -k 'test_autotvm_xgboost_model'

python version

Python 3.7.5

installed python packages:

antlr4-python3-runtime==4.8
Cython==0.29.16
decorator==4.4.2
psutil==5.7.0
pylint==2.4.4
  - astroid [required: >=2.3.0,<2.4, installed: 2.3.3]
    - lazy-object-proxy [required: ==1.4.*, installed: 1.4.3]
    - six [required: ~=1.12, installed: 1.14.0]
    - typed-ast [required: >=1.4.0,<1.5, installed: 1.4.1]
    - wrapt [required: ==1.11.*, installed: 1.11.2]
  - isort [required: >=4.2.5,<5, installed: 4.3.21]
  - mccabe [required: >=0.6,<0.7, installed: 0.6.1]
pytest==5.4.1
  - attrs [required: >=17.4.0, installed: 19.3.0]
  - importlib-metadata [required: >=0.12, installed: 1.6.0]
    - zipp [required: >=0.5, installed: 3.1.0]
  - more-itertools [required: >=4.0.0, installed: 8.2.0]
  - packaging [required: Any, installed: 20.3]
    - pyparsing [required: >=2.0.2, installed: 2.4.7]
    - six [required: Any, installed: 1.14.0]
  - pluggy [required: >=0.12,<1.0, installed: 0.13.1]
    - importlib-metadata [required: >=0.12, installed: 1.6.0]
      - zipp [required: >=0.5, installed: 3.1.0]
  - py [required: >=1.5.0, installed: 1.8.1]
  - wcwidth [required: Any, installed: 0.1.9]
tornado==6.0.4
xgboost==1.0.2
  - numpy [required: Any, installed: 1.18.3]
  - scipy [required: Any, installed: 1.4.1]
    - numpy [required: >=1.13.3, installed: 1.18.3]

backtrace:

~/ws/tvm$ gdb python
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run tests/python/unittest/test_autotvm_xgboost_model.py
Starting program: /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/bin/python tests/python/unittest/test_autotvm_xgboost_model.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4104700 (LWP 11905)]
[New Thread 0x7ffff3903700 (LWP 11906)]
[New Thread 0x7fffef102700 (LWP 11907)]
[New Thread 0x7fffec901700 (LWP 11908)]
[New Thread 0x7fffea100700 (LWP 11909)]
[New Thread 0x7fffe78ff700 (LWP 11910)]
[New Thread 0x7fffe50fe700 (LWP 11911)]
[New Thread 0x7fffe28fd700 (LWP 11912)]
[New Thread 0x7fffe00fc700 (LWP 11913)]
[New Thread 0x7fffdd8fb700 (LWP 11914)]
[New Thread 0x7fffdb0fa700 (LWP 11915)]
[New Thread 0x7fffd88f9700 (LWP 11916)]
[New Thread 0x7fffd60f8700 (LWP 11917)]
[New Thread 0x7fffd38f7700 (LWP 11918)]
[New Thread 0x7fffd10f6700 (LWP 11919)]
[Thread 0x7fffe50fe700 (LWP 11911) exited]
[Thread 0x7fffd88f9700 (LWP 11916) exited]
[Thread 0x7fffd10f6700 (LWP 11919) exited]
[Thread 0x7fffd60f8700 (LWP 11917) exited]
[Thread 0x7fffdb0fa700 (LWP 11915) exited]
[Thread 0x7fffdd8fb700 (LWP 11914) exited]
[Thread 0x7fffe00fc700 (LWP 11913) exited]
[Thread 0x7fffe28fd700 (LWP 11912) exited]
[Thread 0x7fffe78ff700 (LWP 11910) exited]
[Thread 0x7fffea100700 (LWP 11909) exited]
[Thread 0x7fffec901700 (LWP 11908) exited]
[Thread 0x7fffef102700 (LWP 11907) exited]
[Thread 0x7ffff3903700 (LWP 11906) exited]
[Thread 0x7ffff4104700 (LWP 11905) exited]
[Thread 0x7fffd38f7700 (LWP 11918) exited]
[New Thread 0x7fffd10f6700 (LWP 11936)]
[New Thread 0x7fffd38f7700 (LWP 11937)]
[New Thread 0x7fffd60f8700 (LWP 11938)]
[Thread 0x7fffd10f6700 (LWP 11936) exited]
[Thread 0x7fffd60f8700 (LWP 11938) exited]
[Thread 0x7fffd38f7700 (LWP 11937) exited]
[New Thread 0x7fffd38f7700 (LWP 11955)]
[New Thread 0x7fffd60f8700 (LWP 11956)]
[New Thread 0x7fffd10f6700 (LWP 11957)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
(gdb) bt
#0  0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#1  0x00007fffb5b970b7 in xgboost::GenericParameter::ConfigureGpuId(bool) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#2  0x00007fffb5bb2cd1 in xgboost::LearnerImpl::Configure() ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#3  0x00007fffb5bad69d in xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#4  0x00007fffb5ab0639 in XGBoosterUpdateOneIter ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#5  0x00007fffce06adae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#6  0x00007fffce06a71f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#7  0x00007fffce27e944 in _ctypes_callproc () from /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so
#8  0x00007fffce27efb3 in ?? () from /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so
#9  0x00000000005d96db in _PyObject_FastCallKeywords ()
#10 0x000000000054aa51 in ?? ()
#11 0x0000000000551c08 in _PyEval_EvalFrameDefault ()
#12 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#13 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#14 0x000000000054de9e in _PyEval_EvalFrameDefault ()
#15 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#16 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#17 0x000000000054a880 in ?? ()
#18 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#19 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#20 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#21 0x000000000054a880 in ?? ()
#22 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#23 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#24 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#25 0x000000000054a880 in ?? ()
#26 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#27 0x00000000005d88dc in _PyFunction_FastCallKeywords ()
#28 0x000000000054dd08 in _PyEval_EvalFrameDefault ()
#29 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#30 0x000000000054d803 in PyEval_EvalCode ()
#31 0x00000000006308e2 in ?? ()
#32 0x0000000000630997 in PyRun_FileExFlags ()
#33 0x000000000063160f in PyRun_SimpleFileExFlags ()
#34 0x000000000065450e in ?? ()
#35 0x000000000065486e in _Py_UnixMain ()
#36 0x00007ffff7a05b97 in __libc_start_main (main=0x4b84d0 <main>, argc=2, argv=0x7fffffffdfc8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffdfb8) at ../csu/libc-start.c:310
#37 0x00000000005df80a in _start ()
(gdb)

pytest log:

~/ws/tvm$ TVM_FFI=ctypes python3 -m pytest -s -v tests/python/unittest -k 'test_autotvm_xgboost_model'
================================================================ test session starts =================================================================
platform linux -- Python 3.7.5, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/bin/python3
cachedir: .pytest_cache
rootdir: /home/ubuntu/ws/tvm
collecting 88 items                                                                                                                                  Testing using contexts: [cpu(0)]
collected 562 items / 560 deselected / 2 selected

tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault

Thread 0x00007f6c59dd9700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
  File "/usr/lib/python3.7/threading.py", line 870 in run
  File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f6c5cddb700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 422 in _handle_tasks
  File "/usr/lib/python3.7/threading.py", line 870 in run
  File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f6c5c5da700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
  File "/usr/lib/python3.7/threading.py", line 870 in run
  File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007f6c7f460740 (most recent call first):
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/core.py", line 1249 in update
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/training.py", line 74 in _train_internal
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/training.py", line 209 in train
  File "/home/ubuntu/ws/tvm/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
  File "/home/ubuntu/ws/tvm/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/python.py", line 1479 in runtest
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 244 in from_call
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 217 in call_runtest_hook
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 247 in _main
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/_pytest/config/__init__.py", line 125 in main
  File "/home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/pytest/__main__.py", line 7 in <module>
  File "/usr/lib/python3.7/runpy.py", line 85 in _run_code
  File "/usr/lib/python3.7/runpy.py", line 193 in _run_module_as_main
Segmentation fault (core dumped)

@hcho3
Copy link
Contributor

hcho3 commented Apr 22, 2020

Let me try again with TVM_FFI=ctypes environment variable set. What does this do?

@tqchen
Copy link
Member

tqchen commented Apr 22, 2020

The trace might offer some insights, @hcho3 , couldit caused by ConfigureGpuId? also cc @trivialfis since it seems to relates to dmlc/xgboost#4961?

0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
(gdb) bt
#0  0x00007fffb5ba9e37 in std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > xgboost::XGBoostParameter<xgboost::GenericParameter>::UpdateAllowUnknown<std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > >(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, bool*) ()
   from /home/ubuntu/.local/share/virtualenvs/tvm-FxJJpK7X/lib/python3.7/site-packages/xgboost/./lib/libxgboost.so
#1  0x00007fffb5b970b7 in xgboost::GenericParameter::ConfigureGpuId(bool) ()

@trivialfis
Copy link

Is it possible that somehow XGBoost linked a wrong dmlc static library?

@tqchen
Copy link
Member

tqchen commented Apr 22, 2020

i don't know unless we go and dig deeper, but if the bug is reproducible, then it should not be hard to find the cause

@areusch
Copy link
Contributor

areusch commented Apr 22, 2020

tried to reproduce with xgboost built from source (at HEAD/e4f5b6c8 and v1.0.2), no luck (test_autotvm_xgboost_model passes). if I reinstall the pip package (1.0.2), I can get it to reproduce again.

I built xgboost with this config:
cmake -GNinja .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DOPEN_MP:BOOL=ON

any other suggestions to get it to build or install like the pypi package? might it be related to building the package on centos?

@trivialfis
Copy link

@areusch Are you still running bionic when installing from pip?

@areusch
Copy link
Contributor

areusch commented Apr 22, 2020

yes

@leandron
Copy link
Contributor Author

leandron commented Apr 22, 2020

There is something I wanted to point out, which is an insight after @areusch's comment (thanks for that!).

The VM I'm running this test, does not have a GPU. However, the same test used to pass on this very same machine, with xgboost<1. Is that the case for you @areusch?

@hcho3 do you think this could be something caused by a change in behaviour after xgboost>=1.0 ?

@areusch
Copy link
Contributor

areusch commented Apr 22, 2020

no GPU on my instance (it is c5.4xlarge)

@trivialfis
Copy link

I can reproduce it on bionic. Here is what I have found so far:

  • It's only reproducible with bionic, I have both 19.04 (EOL) and 19.10 tested.
  • It's only reproducible with the tvm test, so normal training with XGBoost doesn't have the issue.
  • It's only reproducible with pip installation. Building from source works fine, which is annoying.
  • From valgrind, it has 1 byte invalid read, which is not presented if built from source.
  • Valgrind also reported multiple invalid reads from tvm. (built from source without any cmake flag).

@hcho3 It would be of great help if I can obtain a debug build or RelaseWithDebugInfo build.

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

I am still unable to reproduce it on my Bionic machine. @areusch Can you share the content of your config.cmake?

@trivialfis I'll try to build a wheel using CentOS Docker image.

@trivialfis
Copy link

trivialfis commented Apr 23, 2020

I am still unable to reproduce it on my Bionic machine. @areusch Can you share the content of your config.cmake?

@hcho3 You have to install the binary package on pip to reproduce it. Building from source works fine.

@trivialfis
Copy link

@hcho3 I tried master branch and the commit before:

I wonder if that was due to inconsistency between dmlc-core of tvm and xgb. #5401 updated the logging to latest, please check again.

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@trivialfis And you ran git submodule update --init --recursive?

@trivialfis
Copy link

trivialfis commented Apr 23, 2020

@hcho3 Yes. Currently detached at the commit before above linked PR.

fis@fis-Standard-PC-Q35-ICH9-2009:~/Workspace/XGBoost/incubator-tvm$ git status
HEAD detached at 56941fb9d
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	.gdb_history

nothing added to commit but untracked files present (use "git add" to track)

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@areusch Can you try out the latest TVM master on your end? I'm still having trouble reproducing the original issue.

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

Finally, I reproduced it. Yes! Note: I used latest TVM as of today. The crash still occured.

@trivialfis
Copy link

@hcho3 Could you try applying the patches I posted above and use the corresponding cmake flags?

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@trivialfis I applied your patch and changed CMake flags. And now the unit test does not crash any more.

You should try it too. Get the wheel at https://xgboost-wheels.s3-us-west-2.amazonaws.com/xgboost-1.0.2-py3-none-manylinux1_x86_64.whl.

@trivialfis
Copy link

@hcho3 Yup. It works fine on my machine too

@areusch
Copy link
Contributor

areusch commented Apr 23, 2020

@hcho3 tested with your new wheel on my aws instance and the test now passes!

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@trivialfis Can you elaborate what your patch does? Does it hide certain symbols?

@trivialfis
Copy link

It hides all the symbols, except for C APIs. So if anyone's using C++ header, it might generate a lots of linker errors.

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@trivialfis I think we can hide all C++ symbols when building Python wheels. I don't think anyone using C++ headers would use the Pip wheel. WDYT?

@trivialfis
Copy link

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

@trivialfis Let me file a pull request. We'll include the fix as part of the upcoming 1.1.0 release.

@trivialfis
Copy link

trivialfis commented Apr 23, 2020

We need to be careful about this. Rabit and dmlc core are independently built, I'm not sure what will happen if they throw an error, as hiding symbols means exception can not be propagated out.

@trivialfis
Copy link

Not entirely sure in the context of static linking.

@hcho3
Copy link
Contributor

hcho3 commented Apr 23, 2020

Got it. How about compiling the wheel using latest Ubuntu (not CentOS) and put it in a S3 bucket? The TVM CI can pull from this bucket instead of PyPI.

Current build environment for the Pip wheel is quite old: CentOS 6 + devtoolset-4 (GCC 5.x). When we drop CUDA 9.0 support, we can upgrade the build environment to CentOS 6 + devtoolset-6 (GCC 7.x). Upgrade may fix the issue.

@trivialfis
Copy link

Before upgrading libc dependency, we can try to add a test that forces rabit to throw an error, see if it crashes XGBoost with segfault. An uneven all reduce can do.

Or a test with dmlc core, a file nonexist error seems to be simple.

@trivialfis
Copy link

Just make sure error is not thrown in header.

@trivialfis
Copy link

I believe if it works it will be a net gain for XGBoost, hiding symbols is a good practice for shared libraries.

@hcho3
Copy link
Contributor

hcho3 commented Apr 24, 2020

With dmlc/xgboost#5590, I can now run tests/python/unittest/test_autotvm_xgboost_model.py::test_fit without crashing.

@hcho3
Copy link
Contributor

hcho3 commented Apr 24, 2020

@leandron @areusch @tqchen I've put up RC1 for the upcoming XGBoost 1.1.0 release. Feel free to try it:

python3 -m pip install xgboost==1.1.0rc1

The unit test should not crash.

@tqchen tqchen closed this as completed Apr 27, 2020
@tqchen
Copy link
Member

tqchen commented Apr 27, 2020

Thanks @hcho3 @areusch @leandron @trivialfis for resolving this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants