Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

julia-rc3 & rc4 binary segfault with enough processes #18477

Closed
mauro3 opened this issue Sep 13, 2016 · 16 comments
Closed

julia-rc3 & rc4 binary segfault with enough processes #18477

mauro3 opened this issue Sep 13, 2016 · 16 comments
Milestone

Comments

@mauro3
Copy link
Contributor

mauro3 commented Sep 13, 2016

On a linux machine with AMD processor the binaries of 0.5-rc3 & rc4 both segfault on startup:

~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin $ ./julia
zsh: segmentation fault (core dumped)  ./julia

~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin $ uname -a
Linux ... 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1 (2016-07-04) x86_64 GNU/Linux

~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin $ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : Quad-Core AMD Opteron(tm) Processor 2356
stepping        : 3
microcode       : 0x1000095
cpu MHz         : 2300.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1
gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm ss
e4a misalignsse 3dnowprefetch osvw ibs hw_pstate npt lbrv svm_lock vmmcall
bogomips        : 4600.53
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

On a different machine with a Intel CPU (sharing the file system) the same binary starts and tests work in serial (but not in parallel, but that is probably a different issue). (Edit: actually not true, see below).

Also, I'm currently building from source but also only the serial tests run. (Note for the 0.4 build I had to #10390 (comment), other than that 0.4 worked fine)

@tkelman
Copy link
Contributor

tkelman commented Sep 13, 2016

can you try to get a backtrace in gdb? and try under julia-debug if that gives any different behavior or level of information in the backtrace

@tkelman tkelman added this to the 0.5.0 milestone Sep 13, 2016
@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

Turns out the segfault only happens when another julia process runs with several processes. Running julia in one shell with julia -p13 will start the REPL, but the running another ./julia in another shell will produce the segfault. Running with -p12 will let the second julia run without segfault. The backtrack is:

~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin $ gdb julia-debug 
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from julia-debug...done.
(gdb) run
Starting program: /home/werderm/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin/julia-debug 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4605700 (LWP 130519)]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff777fa23 in jl_gc_alloc_page () at /home/centos/buildbot/slave/package_tarball64/build/src/gc-pages.c:125
125     /home/centos/buildbot/slave/package_tarball64/build/src/gc-pages.c: No such file or directory.
(gdb) bt
#0  0x00007ffff777fa23 in jl_gc_alloc_page () at /home/centos/buildbot/slave/package_tarball64/build/src/gc-pages.c:125
#1  0x00007ffff777a734 in add_page (p=0x7ffff7f9d7a0) at /home/centos/buildbot/slave/package_tarball64/build/src/gc.c:800
#2  0x00007ffff777aa64 in jl_gc_pool_alloc (ptls=0x7ffff7f9d190, pool_offset=1552, osize=96) at /home/centos/buildbot/slave/package_tarball64/build/src/gc.c:868
#3  0x00007ffff7738669 in jl_gc_alloc_ (ptls=0x7ffff7f9d190, sz=88, ty=0x0) at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:148
#4  0x00007ffff773bacb in jl_new_uninitialized_datatype () at /home/centos/buildbot/slave/package_tarball64/build/src/alloc.c:850
#5  0x00007ffff7713f80 in jl_init_types () at /home/centos/buildbot/slave/package_tarball64/build/src/jltypes.c:3496
#6  0x00007ffff77419f0 in _julia_init (rel=JL_IMAGE_JULIA_HOME) at /home/centos/buildbot/slave/package_tarball64/build/src/init.c:645
#7  0x00007ffff7743d0a in julia_init (rel=JL_IMAGE_JULIA_HOME) at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:283
#8  0x0000000000401e86 in main (argc=0, argv=0x7fffffffe460) at /home/centos/buildbot/slave/package_tarball64/build/ui/repl.c:231
(gdb)

Also, running julia -p14 (i.e. with one process more) hangs for some time then prints:

~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin $ ./julia -p14
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

then hangs (for at least 10s of seconds), hitting ctrl-C produced:

^CERROR: connect: connection refused (ECONNREFUSED)
 in yieldto(::Task, ::ANY) at ./event.jl:136
 in wait() at ./event.jl:169
 in wait(::Condition) at ./event.jl:27
 in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
 in wait_connected(::TCPSocket) at ./stream.jl:265
 in connect at ./stream.jl:960 [inlined]
 in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
 in connect(::Base.LocalManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
 in create_worker(::Base.LocalManager, ::WorkerConfig) at ./multi.jl:1786
 in setup_launched_worker(::Base.LocalManager, ::WorkerConfig, ::Array{Int64,1}) at ./multi.jl:1733
 in (::Base.##649#653{Base.LocalManager,Array{Int64,1}})() at ./task.jl:360

...and 13 other exceptions.

 in sync_end() at ./task.jl:311
 in macro expansion at ./task.jl:327 [inlined]
 in #addprocs_locked#645(::Array{Any,1}, ::Function, ::Base.LocalManager) at ./multi.jl:1688
 in #addprocs#644(::Array{Any,1}, ::Function, ::Base.LocalManager) at ./multi.jl:1658
 in #addprocs#748(::Bool, ::Array{Any,1}, ::Function, ::Int32) at ./managers.jl:306
 in process_options(::Base.JLOptions) at ./client.jl:224
 in _start() at ./client.jl:318
UndefRefError()

Going through the same procedure (waiting, then hitting ctrl-c) with gdb produces:

werderm@vierzack03 ~/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin  
 % gdb --args julia-debug -p14
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from julia-debug...done.
(gdb) run
Starting program: /home/werderm/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/bin/julia-debug -p14
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4605700 (LWP 131221)]
[New Thread 0x7ffdec421700 (LWP 131222)]
[New Thread 0x7ffdebc20700 (LWP 131223)]
[New Thread 0x7ffde941f700 (LWP 131224)]
[New Thread 0x7ffde6c1e700 (LWP 131225)]
[New Thread 0x7ffde441d700 (LWP 131226)]
[New Thread 0x7ffde1c1c700 (LWP 131227)]
[New Thread 0x7ffddf41b700 (LWP 131228)]
[New Thread 0x7ffddcc1a700 (LWP 131229)]
[New Thread 0x7ffdda419700 (LWP 131230)]
[New Thread 0x7ffdd7c18700 (LWP 131231)]
[New Thread 0x7ffdd5417700 (LWP 131232)]
[New Thread 0x7ffdd2c16700 (LWP 131233)]
[New Thread 0x7ffdd0415700 (LWP 131234)]
[New Thread 0x7ffdcdc14700 (LWP 131235)]
[New Thread 0x7ffdcb413700 (LWP 131236)]
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
^C
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff777a8aa in jl_gc_pool_alloc (ptls=0x7ffff7f9d190, pool_offset=1432, osize=16)
    at /home/centos/buildbot/slave/package_tarball64/build/src/gc.c:832
832     /home/centos/buildbot/slave/package_tarball64/build/src/gc.c: No such file or directory.
(gdb) bt
#0  0x00007ffff777a8aa in jl_gc_pool_alloc (ptls=0x7ffff7f9d190, pool_offset=1432, osize=16)
    at /home/centos/buildbot/slave/package_tarball64/build/src/gc.c:832
#1  0x00007ffff1d89961 in jlcall_unsafe_convert_3787 () from /home/werderm/julia/julia-0.5-rc4-bin/julia-9c76c3e89a/lib/julia/sys-debug.so
#2  0x00007ffff7717f64 in jl_call_method_internal (meth=0x7ffdf0636f50, args=0x7fffffffcee8, nargs=3)
    at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:189
#3  0x00007ffff771e969 in jl_apply_generic (args=0x7fffffffcee8, nargs=3) at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1929
#4  0x00007ffff1d86e59 in julia_exec_3745 (re=38043760, subject=..., offset=0, options=1073741824, match_data=35495344) at pcre.jl:130
#5  0x00007ffff1d87627 in julia_match_3742 (re=..., str=..., idx=1, add_opts=0) at regex.jl:161
#6  0x00007ffdc39651ce in ?? ()
#7  0x0000000000000024 in ?? ()
#8  0x00007fffffffd2e0 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) 

@tkelman tkelman modified the milestones: 0.5.x, 0.5.0 Sep 13, 2016
@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

This actually happens on the Intel machine after all too, but with different number of procs needed to trigger it.

Also another observation, this now seems to happen only on the Intel machine: I run julia-0.4 (which runs fine as far as I can tell) with julia -p16. Then trying to run julia-0.5 segfaults. (running julia-0.4 does not segfault).

The IT support for those machines is very helpful, so let me know if I should ask them anything.

@mauro3 mauro3 changed the title julia-rc3 & rc4 binary segfault on AMD julia-rc3 & rc4 binary segfault with enough processes Sep 13, 2016
@yuyichao
Copy link
Contributor

This basically means you are running too many processes with the address space quota.

@tkelman
Copy link
Contributor

tkelman commented Sep 13, 2016

Any way to make the failure less alarming, and more descriptive about what the problem was?

@yuyichao
Copy link
Contributor

No. But you can try

diff --git a/src/gc-pages.c b/src/gc-pages.c
index fbe2b27..90be548 100644
--- a/src/gc-pages.c
+++ b/src/gc-pages.c
@@ -45,6 +45,7 @@ void jl_gc_init_page(void)
 // Return `NULL` if allocation failed. Result is aligned to `GC_PAGE_SZ`.
 static char *jl_gc_try_alloc_region(int pg_cnt)
 {
+    jl_safe_printf("Try allocating %d pages\n");
     const size_t pages_sz = sizeof(jl_gc_page_t) * pg_cnt;
     const size_t freemap_sz = sizeof(uint32_t) * pg_cnt / 32;
     const size_t meta_sz = sizeof(jl_gc_pagemeta_t) * pg_cnt;

to see how much it's trying to allocate.

@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

But then why is this a problem with 0.5-rc4 but not with 0.4.6? Or at least the problem on 0.4.6 occurs much later.

@yuyichao
Copy link
Contributor

Because you are not applying the patch to pessimistically use a tiny page size on 0.5.

@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

Ah, my understanding was that this was fixed with #16385.

@yuyichao
Copy link
Contributor

It still need to pick a size and if you have a total quota shared between multiple processes it would have to be very pessimistic in order to not hit the limit.

@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

For 0.4 I used #define REGION_PG_COUNT 8*4096, i.e. 16x smaller than the default. I'll set and build from source and report back. Tnx.

@mauro3
Copy link
Contributor Author

mauro3 commented Sep 13, 2016

Cool, changing #define REGION_PG_COUNT 16*8*4096 to #define REGION_PG_COUNT 8*4096 in src/gc-pages.c (was gc.c in 0.4) works.

However, is this a new issue or is #10390 not resolved? Also, the error is now different to before. Now running the addprocs-example results Master process (id 1) could not connect within 60.0 seconds. errors (see above), then hangs.

@yuyichao
Copy link
Contributor

Dup of #17987

@tkelman
Copy link
Contributor

tkelman commented Sep 13, 2016

No

Can you please explain why not? At least with the older error it was obvious what was happening.

@yuyichao
Copy link
Contributor

All the related allocations are checked afaict and if it still segfaults, there's nothing we can do to figure out why.

@tkelman
Copy link
Contributor

tkelman commented Sep 14, 2016

Is the hang when there are enough processors an error handling issue in the parallel code then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants