-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all qemu_x86_64 tests hang on Ubuntu 18.04 #15877
Comments
@marc-hb and @ceolin have stumbled onto the same thing. Seems to be due to changes in the host toolchain over the last few days. Currently updated F28, F29 and Clear all seem to be producing valid binaries, which I can then copy over to my 18.04 environment and run successfully with the qemu (host and SDK) there. |
Looks like a linkage issue. Here's the 32 bit boot stub (see xuk-stub32.c) at the beginning of a working binary (this is from a xuk.elf unit test, not a Zephyr binary per se):
Basically it sets a stack pointer to one of two hard-configured stacks depending on whether this is a initial (multiboot) startup or an auxilliary SMP CPU, and then enters cstart(). Here's the same entry code for the failing binaries:
There's 64 bytes of garbage prepended! Note that the "bc 00 50 00 00" sequence of the initial MOV is present at offset 0x100040 (though it's obviously being mis-disassembled here). No idea where that's coming from yet. The linker script quite clearly puts a ".xuk_stub32" segment first, and the object file looks correct... |
BTW both gcc 7 and 8 are available on this distro:
Not sure about the linker implications though. |
Heh, contents of the garbage:
Absolutely no idea what this is, but that string tells me 100% it was inserted by the toolchain and not Zephyr's build code. :) |
Yeah, something is wonky on Ubuntu. The toolchain (binutils, not gcc per se) is emitting stuff into the linked binaries that Just Shouldn't Be There. The garbage bytes above are reported in the map file as the contents of the .interp, .gnu.version_d, .gnu.version, .gnu_version_r, .dynsym, .dynstr, .gnu_hash and .eh_frame sections. But the linker script does not reference these seconds, and in fact the only linked object file (xuk-stub32.o) does not even contain these sections. I can write some hacky lines in the linker script to bin these into a garbage segment, and with that the code gets through its 32 bit bootstrapping. It fails on entry into 64 bit mode, I'm guessing, for exactly the same reason. That will probably do for now if I can finish up the workaround. But... I'm at a complete loss here. I don't know where these things are coming from, at all. At a guess... maybe it's an interaction with how Ubuntu has configured LTO in teh compiler (that is, the linker needs to generate code, so it needs to be prepared to emit stuff like this, and isn't correctly turning it off or something). And yeah, I checked: -fno-lto does nothing. |
@andyross have you tried downgrading binutils? In
|
Should be especially easy to downgrade considering Ubuntu websites' still point at the previous version: |
Wrong site, correction: https://packages.ubuntu.com/bionic-updates/binutils |
I found the (very slightly) older packages there:
Yet I still get this:
Ubuntu binutils decoder ring: file $(dpkg -L binutils)
dpkg -L binutils-x86-64-linux-gnu
dpkg -L libbinutils
|
Adding gcc-8 -static -m32 -ffreestanding -fno-pic -fno-asynchronous-unwind-tables -mno-sse -mno-red-zone -Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles -T ../../arch/x86_64/core/xuk-stub32.ld ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.o -o ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf
|
Hilarious. So... that sorta fits within the LTO theory. Normally "-static" is just about library selection, and of course there are no libraries involved in this link (we don't even pull anything from libgcc). But... yeah, maybe it's become a flag to the linker to disable some LTO generation that otherwise is assumed to be needed? If nothing else, this is a cleaner workaround than mine. You want to roll the patch or should I do it? |
Please do, considering my limited understanding I'd much rather you keep the git blame on all this. I just gave this hack a very quick sanitycheck and that passed. It may also burn your house down (but it didn't seem to break Fedora) --- a/arch/x86_64/core/CMakeLists.txt
+++ b/arch/x86_64/core/CMakeLists.txt
@@ -45,6 +45,7 @@ add_custom_command(
-o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
COMMAND ${CMAKE_C_COMPILER} -m32 ${X86_64_BASE_CFLAGS}
-Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles
+ -static
-T ${CMAKE_CURRENT_SOURCE_DIR}/xuk-stub32.ld
${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
-o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.elf
Any more advanced theory as to why Fedora doesn't need it? LTO off by default or something? |
I was focusing on the linker and didn't see these major gcc upgrades at the same time:
|
After stracing
Now this still doesn't seem to explain the very recent regression because Ubuntu says they've been turning on PS: no LTO options difference observed between Ubuntu and Fedora. |
Ah... that's the magic. I love -no-pie because it's clearly correct per the docs. We are not, in fact, generating a position independent executable. So no need to call out a hack or workaround. And nice work digging out the underlying ld command line -- the only real platform difference is that Ubuntu started (mid-LTS, sigh...) to link PIE by default, I guess? But this remains a linker bug, IMHO. I mean, if the linker wants to emit a few PLT relocation entries for the entry point or whatever, that's fine. They go in their own section. The behavior that killed us wasn't that it was emitting useless junk, but that it was including it in the link. We had our own linker script! It's supposed to include what we ask for! And it's not like it's special behavior for the "default" segment or something, it's a custom-defined segment of ours named "stub32". Likewise it's not like it's special to the default ELF output, we're defining our own PHDR! Ugh. But at least it's understood now. Thanks. |
Within the past few days, an update to the Ubuntu 18.04 toolchain has begun emitting code sections during link that are messing with our stub generation. They are appearing in the 32 bit stub link despite not being defined in the single object file, and (worse) being included in the output segment (i.e. at the start of the bootloader entry point!) despite not being specifically included by the linker script. I don't understand this behavior at all, and it appears to be directly contrary to the way the linker is documented. Marc Herbert discovered this was down to gcc being called with --enable-default-pie, so -no-pie works to suppress this behavior and restore the default. And it's correct: we aren't actually generating a position independent executable, even if we don't understand why the linker script is being disregarded (to include sections we don't include). See discussion in the linked github issue. Fixes zephyrproject-rtos#15877 Signed-off-by: Andy Ross <[email protected]>
Within the past few days, an update to the Ubuntu 18.04 toolchain has begun emitting code sections during link that are messing with our stub generation. They are appearing in the 32 bit stub link despite not being defined in the single object file, and (worse) being included in the output segment (i.e. at the start of the bootloader entry point!) despite not being specifically included by the linker script. I don't understand this behavior at all, and it appears to be directly contrary to the way the linker is documented. Marc Herbert discovered this was down to gcc being called with --enable-default-pie, so -no-pie works to suppress this behavior and restore the default. And it's correct: we aren't actually generating a position independent executable, even if we don't understand why the linker script is being disregarded (to include sections we don't include). See discussion in the linked github issue. Fixes #15877 Signed-off-by: Andy Ross <[email protected]>
I downgraded all the packages below and went back in Zephyr history (to commit 154c20c) yet the issue (and the fix) reproduce exactly the same, no difference whatsoever. So while everyone is happy with the fix, no one knows why and how the regression happened.
|
Heh, your persistence is amazing. I mean, personally I'm fine with a level of understanding that stops at "distro toolchains are voodoo" and with a final fix that amounts to "put our own toolchain into the SDK where we can control the configuration". But... I dunno. Maybe it's possible this has never worked on 18.04 and and we just plain never noticed? I mean... I'm absolutely sure this was running on gale over thousands of runs a few weeks back. But... I know I was doing some mix of native builds and ones in a Docker built to emulate the CI environment (so in this case using the 16.04 host compiler), and I know there were a few other bugs fixed in the Ubuntu toolchains. So maybe after fixing the final 16.04 bug I never went back and ran it in the host natively? It's... possible. I mean, if I had to put money down, this feels like a regression. You, Andrew and Flavio all reported the same failure within a day of each other. But... I don't know that I'd give that scenario better than 5:1 odds. |
"was" :-) |
The last message, in all cases, is
My environment:
The text was updated successfully, but these errors were encountered: