Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in sampling multi-processing code #308

Closed
jinhongyii opened this issue Sep 22, 2023 · 8 comments
Closed

Segmentation fault in sampling multi-processing code #308

jinhongyii opened this issue Sep 22, 2023 · 8 comments

Comments

@jinhongyii
Copy link

I encounter a segfault when profiling a program with 2 process, each controlling 1 gpu. Here's the backtrace:


�[01;32mHSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.10.3
�[0m�[01;32mHSA_TOOLS_REPORT_LOAD_FAILURE=1
�[0m�[01;32mLD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.10.3
�[0m�[01;32mOMNITRACE_CRITICAL_TRACE=false
�[0m�[01;32mOMNITRACE_USE_PROCESS_SAMPLING=false
�[0m�[01;32mOMNITRACE_USE_SAMPLING=true
�[0m�[01;32mOMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.10.3
�[0m�[01;32mROCP_HSA_INTERCEPT=1
�[0m�[01;32mROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.10.3
�[0m
�[01;34m[omnitrace][dl][11012] omnitrace_main
�[0m�[0m�[0m�[01;34m[omnitrace][11012][omnitrace_init_tooling] Instrumentation mode: Sampling
�[0m�[0m�[01;34m

      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.3 (x86_64-linux-gnu, compiler: GNU v11.4.0, rocm: v5.7.x)�[0m
�[0m�[0m�[01;34m[omnitrace][11012] /proc/sys/kernel/perf_event_paranoid has a value of 4. Disabling PAPI (requires a value <= 2)...
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is <= 2
�[0m�[0m[280.480]       perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
�[0m�[0m�[01;34m[omnitrace][11023][omnitrace_init_tooling] Instrumentation mode: Sampling
�[0m�[0m�[01;34m

      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.3 (x86_64-linux-gnu, compiler: GNU v11.4.0, rocm: v5.7.x)�[0m
�[0m�[0m�[01;34m[omnitrace][11023] /proc/sys/kernel/perf_event_paranoid has a value of 4. Disabling PAPI (requires a value <= 2)...
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11023] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is <= 2
�[0m�[0m[281.460]       perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

�[01;33m[omnitrace][11012][505] Signal 11 caught : Segmentation fault (Address not mapped to object [0x8])
�[0m
�[01;31m### ERROR ### [omnitrace][PID=11012][TID=505] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x8�[01;31m
Backtrace:
[PID=11012][TID=505][0/9] __sigaction +0x50
[PID=11012][TID=505][1/9] OnUnload +0x3ea18
[PID=11012][TID=505][2/9] roctracer_open_pool +0xdbe
[PID=11012][TID=505][3/9] roctracer_open_pool +0x15f3
[PID=11012][TID=505][4/9] _ZNKSt10error_code23default_error_conditionEv +0x33
[PID=11012][TID=505][5/9] _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE17_M_emplace_uniqueIJRA14_KcEEESt4pairISt17_Rb_tree_iteratorIS5_EbEDpOT_ +0x3bd0f2
[PID=11012][TID=505][6/9] _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE17_M_emplace_uniqueIJRA14_KcEEESt4pairISt17_Rb_tree_iteratorIS5_EbEDpOT_ +0x3beaf1
[PID=11012][TID=505][7/9] pthread_condattr_setpshared +0x513
[PID=11012][TID=505][8/9] __xmknodat +0x230

Backtrace (demangled):
[PID=11012][TID=505][0/9] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fde95a42520]
[PID=11012][TID=505][1/9] /opt/omnitrace/lib/libomnitrace.so(+0x854338) [0x7fde93854338]
[PID=11012][TID=505][2/9] /opt/rocm-5.7.0/lib/libroctracer64.so.4(+0x23c2e) [0x7fde95d81c2e]
[PID=11012][TID=505][3/9] /opt/rocm-5.7.0/lib/libroctracer64.so.4(+0x24463) [0x7fde95d82463]
[PID=11012][TID=505][4/9] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fde956dc253]
[PID=11012][TID=505][5/9] /opt/omnitrace/lib/libomnitrace.so(+0xf2c8f2) [0x7fde93f2c8f2]
[PID=11012][TID=505][6/9] /opt/omnitrace/lib/libomnitrace.so(+0xf2e2f1) [0x7fde93f2e2f1]
[PID=11012][TID=505][7/9] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fde95a94b43]
[PID=11012][TID=505][8/9] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fde95b26a00]
----
lots of unreadable output
----
Backtrace (demangled):
[PID=11012][TID=505][0/9] __sigaction +0x50
[PID=11012][TID=505][1/9] OnUnload +0x3ea18
[PID=11012][TID=505][2/9] roctracer_open_pool +0xdbe
[PID=11012][TID=505][3/9] roctracer_open_pool +0x15f3
[PID=11012][TID=505][4/9] std::error_code::default_error_condition() const +0x33
[PID=11012][TID=505][5/9] std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14]) +0x3bd0f2
[PID=11012][TID=505][6/9] std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14]) +0x3beaf1
[PID=11012][TID=505][7/9] pthread_condattr_setpshared +0x513
[PID=11012][TID=505][8/9] __xmknodat +0x230

Backtrace (lineinfo):
[PID=11012][TID=505][0/7]
    �[01;32m[libc_sigaction.c:?]�[01;31m __restore_rt
[PID=11012][TID=505][1/7]
    �[01;32m[/opt/rocm-5.7.0/lib/libroctracer64.so.4.1.0:?]�[01;31m roctracer_open_pool
[PID=11012][TID=505][2/7]
    �[01;32m[/opt/rocm-5.7.0/lib/libroctracer64.so.4.1.0:?]�[01;31m roctracer_open_pool
[PID=11012][TID=505][3/7]
    �[01;32m[/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30:?]�[01;31m std::error_code::default_error_condition() const
[PID=11012][TID=505][4/7]
    �[01;32m[/opt/omnitrace/lib/libomnitrace.so.1.10.3:?]�[01;31m std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14])
[PID=11012][TID=505][5/7]
    �[01;32m[/opt/omnitrace/lib/libomnitrace.so.1.10.3:?]�[01;31m std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14])
[PID=11012][TID=505][6/7]
    �[01;32m[./nptl/./nptl/pthread_create.c:442]�[01;31m start_thread
�[0m
�[0m�[0m�[01;34m[omnitrace][11012] Finalizing after signal 11 ::  Signal:    SIGSEGV (signal number:  11)                   segmentation violation
�[0m�[0m
�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] finalizing...
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] 
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] omnitrace/process/11012 : 3.335331 sec wall_clock,  858.572 MB peak_rss,  879.178 MB page_rss, 3.510000 sec cpu_clock,  105.2 % cpu_util [laps: 1]
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] omnitrace/process/11012/thread/2 : 0.000043 sec wall_clock, 0.000043 sec thread_cpu_clock,  100.0 % thread_cpu_util,    0.256 MB peak_rss [laps: 1]
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] 
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] Finalizing perfetto...
�[0m�[0m�[01;32m[omnitrace][11012][perfetto]> Outputting '/home/hongyi/mlc-llm/omnitrace-python3-output/2023-09-22_17.03/perfetto-trace-11012.proto' (9540.86 KB / 9.54 MB / 0.01 GB)... �[01;32mDone�[0m
�[01;32m[omnitrace][11012][metadata]> Outputting 'omnitrace-python3-output/2023-09-22_17.03/metadata-11012.json' and 'omnitrace-python3-output/2023-09-22_17.03/functions-11012.json'�[0m
�[0m�[0m�[01;34m[omnitrace][11012][505][omnitrace_finalize] Finalized: 0.119384 sec wall_clock,  125.852 MB peak_rss,   37.208 MB page_rss, 0.100000 sec cpu_clock,   83.8 % cpu_util
�[0m�[0m�[0m
�[01;33m[omnitrace][11012] Killing process 11012 with signal 11...
�[0m
�[01;33m[omnitrace][11012][0] Signal 11 caught : 
�[01;33m[omnitrace][11012][505] Signal 11 caught : Segmentation fault (Signal sent by kill() [0x3eb00002b04])
�[0m
�[01;31mSegmentation fault (Signal sent by kill() [0x3eb00002b04])
### ERROR ### [omnitrace][PID=11012][TID=505] signal=11 (SIGSEGV) segmentation violation. code: 0 (SI_USER :: Sent by kill(), pthread_kill(), raise(), abort() or alarm()), address of faulting memory reference: 0x3eb00002b04�[01;31m�[0m
Backtrace:
[PID=11012][TID=505][0/9] __sigaction +0x50
[PID=11012][TID=505][1/9] OnUnload +0x3ea18
[PID=11012][TID=505][2/9] roctracer_open_pool +0xdbe
[PID=11012][TID=505][3/9] roctracer_open_pool +0x15f3
[PID=11012][TID=505][4/9] _ZNKSt10error_code23default_error_conditionEv +0x33
[PID=11012][TID=505][5/9] _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE17_M_emplace_uniqueIJRA14_KcEEESt4pairISt17_Rb_tree_iteratorIS5_EbEDpOT_ +0x3bd0f2
[PID=11012][TID=505][6/9] _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES5_St9_IdentityIS5_ESt4lessIS5_ESaIS5_EE17_M_emplace_uniqueIJRA14_KcEEESt4pairISt17_Rb_tree_iteratorIS5_EbEDpOT_ +0x3beaf1
[PID=11012][TID=505][7/9] pthread_condattr_setpshared +0x513
[PID=11012][TID=505][8/9] __xmknodat +0x230

Backtrace (demangled):
[PID=11012][TID=505][0/9] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fde95a42520]
[PID=11012][TID=505][1/9] /opt/omnitrace/lib/libomnitrace.so(+0x854338) [0x7fde93854338]
[PID=11012][TID=505][2/9] /opt/rocm-5.7.0/lib/libroctracer64.so.4(+0x23c2e) [0x7fde95d81c2e]
[PID=11012][TID=505][3/9] /opt/rocm-5.7.0/lib/libroctracer64.so.4(+0x24463) [0x7fde95d82463]
[PID=11012][TID=505][4/9] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fde956dc253]
[PID=11012][TID=505][5/9] /opt/omnitrace/lib/libomnitrace.so(+0xf2c8f2) [0x7fde93f2c8f2]
[PID=11012][TID=505][6/9] /opt/omnitrace/lib/libomnitrace.so(+0xf2e2f1) [0x7fde93f2e2f1]
[PID=11012][TID=505][7/9] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fde95a94b43]
[PID=11012][TID=505][8/9] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fde95b26a00]
----
lots of unreadable output
----
Backtrace (demangled):
[PID=11012][TID=505][0/9] __sigaction +0x50
[PID=11012][TID=505][1/9] OnUnload +0x3ea18
[PID=11012][TID=505][2/9] roctracer_open_pool +0xdbe
[PID=11012][TID=505][3/9] roctracer_open_pool +0x15f3
[PID=11012][TID=505][4/9] std::error_code::default_error_condition() const +0x33
[PID=11012][TID=505][5/9] std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14]) +0x3bd0f2
[PID=11012][TID=505][6/9] std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14]) +0x3beaf1
[PID=11012][TID=505][7/9] pthread_condattr_setpshared +0x513
[PID=11012][TID=505][8/9] __xmknodat +0x230

Backtrace (lineinfo):
[PID=11012][TID=505][0/7]
    �[01;32m[libc_sigaction.c:?]�[01;31m __restore_rt
[PID=11012][TID=505][1/7]
    �[01;32m[/opt/rocm-5.7.0/lib/libroctracer64.so.4.1.0:?]�[01;31m roctracer_open_pool
[PID=11012][TID=505][2/7]
    �[01;32m[/opt/rocm-5.7.0/lib/libroctracer64.so.4.1.0:?]�[01;31m roctracer_open_pool
[PID=11012][TID=505][3/7]
    �[01;32m[/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30:?]�[01;31m std::error_code::default_error_condition() const
[PID=11012][TID=505][4/7]
    �[01;32m[/opt/omnitrace/lib/libomnitrace.so.1.10.3:?]�[01;31m std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14])
[PID=11012][TID=505][5/7]
    �[01;32m[/opt/omnitrace/lib/libomnitrace.so.1.10.3:?]�[01;31m std::pair<std::_Rb_tree_iterator<std::string >, bool> std::_Rb_tree<std::string, std::string, std::_Identity<std::string >, std::less<std::string >, std::allocator<std::string>>::_M_emplace_unique<char const (&) [14]>(char const (&) [14])
[PID=11012][TID=505][6/7]
    �[01;32m[./nptl/./nptl/pthread_create.c:442]�[01;31m start_thread
�[0m
�[0m�[0m�[01;34m[omnitrace][11012] Finalizing after signal 11 ::  Signal:    SIGSEGV (signal number:  11)                   segmentation violation
�[0m�[0m�[0m

I once thought it might be the same cause with #304 , but the error message and backtrace are different, so I'm not sure.

The command I use is omnitrace-sample python3 xxx.py
I'm using rocm 5.7 in ubuntu 22.04.

@jrmadsen
Copy link
Collaborator

Can you set the environment variable OMNITRACE_VERBOSE=2 or run with omnitrace-sample -v 2 -- python3 xxx.py and provide the log? Also either set OMNITRACE_MONOCHROME=ON / --monochrome to get rid of the color characters when you post the backtrace

I’m looking for where the error happens in the OnUnload function.

@jinhongyii
Copy link
Author

log.txt

After the segfault, omnitrace repeatedly outputs

[omnitrace][23016][506][offload_buffer] Offloading 2048 samples for thread 3 to /tmp/omnitrace-python3-output/25407/sampling-23016.dat...

I don't know when it stops, so I kill omnitrace.

@jinhongyii
Copy link
Author

any update on this?

@jrmadsen
Copy link
Collaborator

Is your library that is built against ROCm, /home/hongyi/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-tp2/Llama-2-7b-chat-hf-q4f16_1-rocm.so, built for ROCm 5.7?

@jinhongyii
Copy link
Author

yes. It's built on the same machine I do profiling, and there is only one ROCm version 5.7 on it, so I think it is.

@jrmadsen
Copy link
Collaborator

Given that looks like a python wheel and that ROCm 5.7 was just released recently, that might explain it. Omnitrace tends to encounter issues like this when there is a mismatch between the minor version of ROCm at runtime and the version it was built against.

@jinhongyii
Copy link
Author

Thanks. I'll try again after you release Omnitrace pre-built for ROCm 5.7.

@jrmadsen
Copy link
Collaborator

This was likely fixed in #309 and/or using ROCm 5.7. Also, make sure you have OMNITRACE_CRITICAL_TRACE=OFF, there is still a data race there (critical tracing will be removed soon as it was incomplete due to it being superseded by causal profiling)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants