Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code_native broken on official macOS binaries #28046

Closed
distractedlambda opened this issue Jul 10, 2018 · 23 comments · Fixed by #30554
Closed

code_native broken on official macOS binaries #28046

distractedlambda opened this issue Jul 10, 2018 · 23 comments · Fixed by #30554
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version

Comments

@distractedlambda
Copy link

distractedlambda commented Jul 10, 2018

This issue stems from a discussion I started on the Julia Discourse forum.

In short, the code_native() introspection function seems to work inconsistently (and sometimes break completely) on both Linux and macOS, based on results from several macOS and Linux setups with several Julia versions. I'll be giving the versioninfo() for the two setups I used (macOS 10.13.5 on Apple hardware, Fedora 28 inside a Docker container), along with the results of several code_native() calls to demonstrate the issue. My macOS setup is running the just-released Julia 0.6.4 as of writing this issue, but I experienced the exact same behavior on macOS with Julia 0.6.3.

macOS versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Fedora versioninfo()

Julia Version 0.6.3
Commit d55cadc (2018-05-28 20:20 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=128)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Fedora Behavior

As a very simple test case, let's dump the assembly for +(::Int, ::Int):

julia> code_native(+, (Int, Int), :intel)
WARNING: Could not determine size of symbol

That's strange. If we try wrapping it in another function:

julia> f(a, b) = a + b
f (generic function with 1 method)

julia> code_native(f, (Int, Int), :intel)
        .text
Filename: REPL[3]
        push    rbp
        mov     rbp, rsp
Source line: 1
        lea     rax, [rdi + rsi]
        pop     rbp
        ret
        nop     word ptr [rax + rax]

That works, but I see no reason why dumping code for + should fail. I also wonder if the WARNING being displayed in place of an assembly printout is text emitted from external code (i.e. not Julia), since I can't think of what the "Could not determine size of symbol" would mean in Julia.

macOS Behavior

Repeating the same initial test as before:

julia> code_native(+, (Int, Int), :intel)
        .section        __TEXT,__text,regular,pure_instructions
Filename: int.jl
        push    ebp
        dec     eax
        mov     ebp, esp
Source line: 32
        dec     eax
        lea     eax, [edi + esi]
        pop     ebp
        ret
Source line: 32
        nop
        nop
        nop
        nop
        nop
        nop

This time we get an assembly printout, but it's completely wrong. We're seeing only 32-bit instructions, though the actual code is of course 64-bit. Additionally, we've got needless dec eax instructions interspersed, which I suspect aren't actually in the generated code. We can try wrapping + in a function again, but we still get the broken ASM.


To show that this problem is limited to code_native, we can try code_llvm, which always gives identical (and correct) output:

julia> code_llvm(+, (Int, Int))

define i64 @"jlsys_+_60291"(i64, i64) #0 !dbg !5 {
top:
  %2 = add i64 %1, %0
  ret i64 %2
}

While I used the + case as a minimal example, no code_native call I've tried has produced correct (i.e. 64-bit) assembly on my macOS setup. Another user (by the username Per in the discussion linked above) was able to get code_native working correctly with a source-built Julia 0.7.0-beta.214 on macOS High Sierra (10.13.x), but experienced the same macOS behavior I did with a downloaded 0.7.0-beta.0 and a downloaded 0.6.3 on macOS Sierra (10.12.x). Finally, with a locally-rebuilt system image in Julia 0.6.3 and macOS Sierra, Per was able to reproduce the WARNING output I experienced on Fedora.

@nalimilan
Copy link
Member

I also wonder if the WARNING being displayed in place of an assembly printout is text emitted from external code (i.e. not Julia), since I can't think of what the "Could not determine size of symbol" would mean in Julia.

FWIW, it's printed by Julia:

jl_printf(JL_STDERR, "WARNING: Could not determine size of symbol\n");

@distractedlambda
Copy link
Author

@nalimilan Interesting; can't believe I didn't think to just grep through the Julia sources for that message. But this does confirm a couple things for me:

  1. code_native works by disassembling in-memory binaries, rather than by making some separate LLVM call to generate assembly source or something (which makes sense).
  2. The "symbol" the WARNING refers to is not a Julia symbol, but rather a binary-code-level symbol.

Looking through the source file you linked, it appears that the warning refers to not being able to either look up or determine the size of the requested function. I'm somewhat confused as to why this is a WARNING instead of an ERROR (or better yet a Julia exception), since jl_dump_fptr_asm immediately returns after printing that message without doing its only job. Regardless, it's interesting to me that getting this message seems to (at least to some degree) be related to how the system image was built.

As for the issue being experienced on macOS, it looks like that would have to be occurring in jl_dump_asm_internal, but I would have barely any idea where to start with that. It occurred to me to check something basic like the return value of jl_get_llvm_disasm_target, but I need to get a Julia setup with debug symbols first...

@distractedlambda
Copy link
Author

It just occurred to me to try disassembling +(::Int, ::Int) on Fedora with the prebuilt system image disabled, and sure enough, it works. So it looks like the WARNING issue isn't really specific to Linux, but rather to how the system image is built. Since a user-defined function isn't part of the system image, it disassembles just fine.

@distractedlambda
Copy link
Author

distractedlambda commented Jul 11, 2018

Also, perhaps this deserves a separate issue, but on either of my setups, running julia -C help prints the machine and feature list 3 times, followed by:

ERROR: Julia and the system image were compiled for different architectures.
Please delete or regenerate sys.{so,dll,dylib}.

And if I try to set a CPU with a misspelled name, I get 7 identical error messages:

'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)
'hawell' is not a recognized processor for this target (ignoring processor)

@distractedlambda
Copy link
Author

Ignore that; thought the "Close and comment" button was a "Close comment" button.

@yuyichao
Copy link
Contributor

So it looks like the WARNING issue isn't really specific to Linux, but rather to how the system image is built

It usually means that the debug info is stripped or not installed.

And if I try to set a CPU with a misspelled name, I get 7 identical error messages:

Fixed on master. Not going to be fixed on 0.6.

@yuyichao
Copy link
Contributor

running julia -C help prints the machine and feature list 3 times, followed by:

Also fixed on master and not backportable.

@distractedlambda
Copy link
Author

@yuyichao Okay, that's good to know.

It usually means that the debug info is stripped or not installed.

So then does that mean that most of the official Linux binaries don't include debug symbols in the system image? I have both julia and julia-devel installed under a Fedora 28 VM, and I still get the WARNING if I try to disassemble a specialization from the system image.

@yuyichao
Copy link
Contributor

So then does that mean that most of the official Linux binaries don't include debug symbols in the system image? I have both julia and julia-devel installed under a Fedora 28 VM

Judging from the package name these are NOT official Linux binaries. The official ones are the ones you can download from julialang.org. -devel package does not have anything to do with debugging. If I read this correctly, you need the -debuginfo package.

@distractedlambda
Copy link
Author

@yuyichao Okay, that actually makes a lot more sense then. The julia-devel package described itself as "Julia development, debugging and testing files", which gave me hope it might include debug symbols for the system image, somehow.

So that demystifies the WARNING issue, but still leaves the incorrect-disasm-on-macos issue.

@distractedlambda
Copy link
Author

If I build Julia 0.6.4 from source on macOS, I get a system image with debug symbols and 64-bit assembly outputs. This solves the issue for me personally, since I'm okay with building from source, but it's strange that I get such different results from the official binary. I'm changing the name of the issue to reflect what it now seems to be.

@distractedlambda distractedlambda changed the title code_native unstable/broken on Linux, macOS code_native broken on official macOS binaries Jul 14, 2018
@JeffBezanson JeffBezanson added bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version labels Jul 23, 2018
@ikirill
Copy link

ikirill commented Oct 22, 2018

This issue prevents me from using tools like llvm-mca (https://llvm.org/docs/CommandGuide/llvm-mca.html). Similar to Intel's iaca, llvm-mca is a tool that can compute throughput and latency for code in a tight loop, for example. It requires valid meaningful assembly input, which code_native isn't generating right now.

@vchuravy
Copy link
Member

@ikirill take a look at https://github.com/vchuravy/IACA.jl, I would be delighted to see a PR that adds support for llvm-mca

@ikirill
Copy link

ikirill commented Oct 23, 2018

@vchuravy Is IACA.jl working now? I remember trying and failing to get it working a while ago. Basic support for llvm-mca is really easy: https://github.com/ikirill/LlvmMca.jl/blob/master/LlvmMca.jl#L21 (no LLVM.jl dependency or needing a source build or anything like that).

@vchuravy
Copy link
Member

Yes I fixed IACA.jl for 1.0, if you find anything that doesn't work please report it. The reason why I use LLVM.jl is because the output of code_native is not required to be stable, and IACA.jl also no longer needs a source-build. The README is just outdated

@maleadt
Copy link
Member

maleadt commented Oct 24, 2018

no LLVM.jl dependency or needing a source build or anything like that

Just FYI, LLVM.jl doesn't require a source build anymore.
Only builds with LLVM_SHLIB=0 (ie. Arch Linux) are unsupported.

@maxbennedich
Copy link
Contributor

code_native still broken in Julia 1.0.1 on MacOS (official binary), generating 32 bit / nonsense assembly:

julia> big_addition(x) = x + 0x123456789abcdef0
big_addition (generic function with 1 method)

julia> code_native(big_addition, (UInt64,); syntax=:intel)
	.section	__TEXT,__text,regular,pure_instructions
; Function big_addition {
; Location: REPL[1]:1
; Function +; {
; Location: REPL[1]:1
	dec	eax
	mov	eax, 2596069104
	js	0x58
	xor	al, 18
	dec	eax
	add	eax, edi
;}
	ret
	nop
;}

@eschnett
Copy link
Contributor

This might be because the LLVM disassembler is using the wrong hardware architecture. Julia determines this setting by calling sys::getDefaultTargetTriple() in LLVM, which (I assume) might not necessarily be the same as the one the code generator is using. See line 631 in src/disasm.cpp.

If that's the case, the solution would be for the code generator to store the target information (if it doesn't already do so), and for the disassembler to use this, instead of asking LLVM to look at the host hardware and guess from there.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 30, 2018

It's stored in the global variable jl_TargetMachine, set from llvm::sys::getProcessTriple.

@eschnett
Copy link
Contributor

This sounds as if it was worthwhile trying to replace the statements

std::string TripleName = sys::getDefaultTargetTriple();
Triple TheTriple(Triple::normalize(TripleName));

by

std::string TripleName = jl_TargetMachine->getTargetTriple();
Triple TheTriple(TripleName);

in disasm.cpp.

@maleadt
Copy link
Member

maleadt commented Jan 1, 2019

This might be because the LLVM disassembler is using the wrong hardware architecture.

This sounds like an issue I had with macOS binaries: maleadt/LLVM.jl#122 (comment), where the default triple comes from the system where libLLVM was built on instead.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 2, 2019

# source build:
(lldb) x/s $rsi
0x1035067f0: "x86_64-apple-darwin17.7.0"

# binaries:
(lldb) x/s $rsi
0x101c89705: "i386-apple-darwin14.5.0"

which is really just printing out the configuration:

~/julia$ grep -R LLVM_DEFAULT_TARGET_TRIPLE usr/include/llvm
usr/include/llvm/Config/llvm-config.h:#define LLVM_DEFAULT_TARGET_TRIPLE "x86_64-apple-darwin16.7.0"

so it seems like the build script is configuring the distributed binaries incorrectly. tracing back further through the log files (e.g. https://build.julialang.org/#/builders/1/builds/330/steps/4/logs/stdio), it appears to be a build issue: we're missing setting the custom LLVM_HOST_TRIPLE variable during build configuration time (maps to the standard --host flag in autotools that we usually set)

vtjnash added a commit that referenced this issue Jan 2, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
vtjnash added a commit that referenced this issue Jan 3, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
KristofferC pushed a commit that referenced this issue Jan 11, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046

(cherry picked from commit 041c214)
KristofferC pushed a commit that referenced this issue Feb 4, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
KristofferC pushed a commit that referenced this issue Feb 4, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
KristofferC pushed a commit that referenced this issue Feb 11, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
KristofferC pushed a commit that referenced this issue Feb 11, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
KristofferC pushed a commit that referenced this issue Feb 11, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046

(cherry picked from commit 041c214)
KristofferC pushed a commit that referenced this issue Apr 20, 2019
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
@non-Jedi
Copy link
Contributor

I'm still seeing this issue with Julia 1.1.1 on linux with binaries built from source. As far as I can tell, the fix PR #30554 made it in to 1.1.1.

KristofferC pushed a commit that referenced this issue Feb 20, 2020
broken by their move to cmake causing a switch away from the standard --host/--build autoconf

fix #28046
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.