Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown machine / CPUID related OpenBLAS issue #18692

Closed
floswald opened this issue Sep 27, 2016 · 6 comments
Closed

unknown machine / CPUID related OpenBLAS issue #18692

floswald opened this issue Sep 27, 2016 · 6 comments
Labels
needs more info Clarification or a reproducible example is required

Comments

@floswald
Copy link

I'm still wrestling with julia running up against a virtual memory limit imposed by a SGE scheduler. I built julia-0.5 with the fix suggested in this issue comment. I requested 16GB of virtual and 16GB of total memory from the scheduler. From looking at top previously, I know that this job consumes 2GB of RES and 6482MB of VIRT memory on this machine, so should fit comfortably within those limits?

signal (4): Illegal instruction
while loading no file, in expression starting on line 0
dgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 24608392 (Pool: 24606549; Big: 1843); GC: 50
Illegal instruction

What does this mean?

  • is setting #define REGION_PG_COUNT 8*4096 in src/gc-pages.c too restrictive now?
  • What is the recommended solution for cases like this now?
  • Is there anything I could tell my sysadmin to do so that this problem becomes less likely? Is it possible for julia to use swapfiles (I'm just forwarding this as a suggestion from my sysadmin, I know nothing about it)?

I now ran the unit tests and get the same behaviour.

[uctpfos@lum-7-14 julia]$ make test
    JULIA test/all

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/dense.jl, in expression starting on line 7
dgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 3522101 (Pool: 3521101; Big: 1000); GC: 4
        From worker 4:       * linalg/dense         Worker 4 terminated.
ERROR (unhandled task failure): EOFError: read end of file

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/bunchkaufman.jl, in expression starting on line 23
sgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 6470298 (Pool: 6469304; Big: 994); GC: 9
        From worker 9:       * linalg/bunchkaufman  Worker 9 terminated.
ERROR (unhandled task failure): EOFError: read end of file
ERROR: connect: connection refused (ECONNREFUSED)
 in yieldto(::Task, ::ANY) at ./event.jl:136
 in wait() at ./event.jl:169
 in wait(::Condition) at ./event.jl:27
 in stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N}) at ./stream.jl:44
 in wait_connected(::TCPSocket) at ./stream.jl:265
 in connect at ./stream.jl:960 [inlined]
 in connect_to_worker(::String, ::Int16) at ./managers.jl:483
 in connect_w2w(::Int64, ::WorkerConfig) at ./managers.jl:446
 in connect(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./managers.jl:380
 in connect_to_peer(::Base.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./multi.jl:1479
 in (::Base.##637#639)() at ./task.jl:360
Error [connect: connection refused (ECONNREFUSED)] on 10 while connecting to peer 9. Exiting.
Worker 10 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
 in process_hdr(::TCPSocket, ::Bool) at ./multi.jl:1410
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1299
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./multi.jl:1276
 in (::Base.##618#619{TCPSocket,TCPSocket,Bool})() at ./event.jl:68

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/eigen.jl, in expression starting on line 19
sgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 7954392 (Pool: 7953282; Big: 1110); GC: 12
        From worker 8:       * linalg/eigen         Worker 8 terminated.
ERROR (unhandled task failure): EOFError: read end of file

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/special.jl, in expression starting on line 120
dgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 6878561 (Pool: 6877145; Big: 1416); GC: 10
        From worker 7:       * linalg/special       Worker 7 terminated.
ERROR (unhandled task failure): EOFError: read end of file

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/schur.jl, in expression starting on line 19
sgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 10374574 (Pool: 10373418; Big: 1156); GC: 16
        From worker 6:       * linalg/schur         Worker 6 terminated.
ERROR (unhandled task failure): EOFError: read end of file

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/qr.jl, in expression starting on line 23
sgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 11909644 (Pool: 11908372; Big: 1272); GC: 19
        From worker 3:       * linalg/qr            Worker 3 terminated.
ERROR (unhandled task failure): EOFError: read end of file

signal (4): Illegal instruction
while loading /share/apps/econ/acapp/git/julia/test/linalg/triangular.jl, in expression starting on line 17
dgemm_oncopy at /share/apps/econ/acapp/git/julia/usr/bin/../lib/libopenblas64_.so (unknown line)
Allocations: 10104476 (Pool: 10101115; Big: 3361); GC: 16
        From worker 2:       * linalg/triangular    Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 27, 2016

Is this a virtual machine? From the error, it appears that the CPUID may be lying about what instructions it supports and angering OpenBLAS

@vtjnash vtjnash added the needs more info Clarification or a reproducible example is required label Sep 27, 2016
@vtjnash vtjnash changed the title how to deal with v0.5 memory problems on system with restrictions unknown machine / CPUID related OpenBLAS issue Sep 27, 2016
@floswald
Copy link
Author

This is not a virtual machine. it's a compute cluster running Centos 5.5, with Rocks cluster management. So, I was under the impression this is a memory issue?
I did compile it with make MARCH=x86-64, after I applied the patch suggested above.
I get similar errors signal (4): Illegal instruction if I install the current release pre-compiled binary and run Base.runtests().

@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 27, 2016

What CPU / what's versioninfo()? SIGILL usually means the processor doesn't support some instruction it is supposed to support according to the CPUID it is declaring.

@floswald
Copy link
Author

this is issue #18701 .

@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 27, 2016

Did you delete your response? From the versioninfo result you posted, I think that particular AMD processor doesn't support certain very useful instructions that all Intel processors support, so it's not supported. (https://en.wikipedia.org/wiki/X86-64#Older_implementations). You might try posting an issue upstream to OpenBLAS to see if the maintainer there wants to add detection there to disable the optimized kernels for this processor.

@floswald
Copy link
Author

yes i did delete it! sorry.

julia> versioninfo()
Julia Version 0.5.0
Commit ee8dca9* (2016-09-24 19:29 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron (tm) Processor 875
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY OPTERON)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, k8-sse3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info Clarification or a reproducible example is required
Projects
None yet
Development

No branches or pull requests

2 participants