Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

Closed
andrewboie opened this issue May 3, 2019 · 18 comments · Fixed by #15891
Closed

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

andrewboie opened this issue May 3, 2019 · 18 comments · Fixed by #15891
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@andrewboie
Copy link
Contributor

The last message, in all cases, is

Booting from ROM..

My environment:

Linux yggdrasil 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0

@andrewboie andrewboie added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels May 3, 2019
@andyross
Copy link
Contributor

andyross commented May 3, 2019

@marc-hb and @ceolin have stumbled onto the same thing. Seems to be due to changes in the host toolchain over the last few days. Currently updated F28, F29 and Clear all seem to be producing valid binaries, which I can then copy over to my 18.04 environment and run successfully with the qemu (host and SDK) there.

@andyross
Copy link
Contributor

andyross commented May 3, 2019

Looks like a linkage issue. Here's the 32 bit boot stub (see xuk-stub32.c) at the beginning of a working binary (this is from a xuk.elf unit test, not a Zephyr binary per se):

Disassembly of section .text:

00100000 <__bss_end-0x1d2c>:
  100000:       bc 00 50 00 00          mov    $0x5000,%esp
  100005:       31 d2                   xor    %edx,%edx
  100007:       3d f7 8d f0 aa          cmp    $0xaaf08df7,%eax
  10000c:       0f 44 a2 00 40 00 00    cmove  0x4000(%edx),%esp
  100013:       53                      push   %ebx
  100014:       50                      push   %eax
  100015:       e8 bc 03 00 00          call   1003d6 <__bss_end-0x1956>

Basically it sets a stack pointer to one of two hard-configured stacks depending on whether this is a initial (multiboot) startup or an auxilliary SMP CPU, and then enters cstart().

Here's the same entry code for the failing binaries:

00100000 <.text>:
  100000:       2f                      das    
  100001:       6c                      insb   (%dx),%es:(%edi)
  100002:       69 62 2f 6c 64 2d 6c    imul   $0x6c2d646c,0x2f(%edx),%esp
  100009:       69 6e 75 78 2e 73 6f    imul   $0x6f732e78,0x75(%esi),%ebp
  100010:       2e 32 00                xor    %cs:(%eax),%al
        ...
  100027:       00 01                   add    %al,(%ecx)
  100029:       00 00                   add    %al,(%eax)
  10002b:       00 01                   add    %al,(%ecx)
  10002d:       00 00                   add    %al,(%eax)
  10002f:       00 01                   add    %al,(%ecx)
        ...
  10003d:       00 00                   add    %al,(%eax)
  10003f:       00 bc 00 50 00 00 31    add    %bh,0x31000050(%eax,%eax,1)
  100046:       d2 3d f7 8d f0 aa       sarb   %cl,0xaaf08df7

There's 64 bytes of garbage prepended! Note that the "bc 00 50 00 00" sequence of the initial MOV is present at offset 0x100040 (though it's obviously being mis-disassembled here).

No idea where that's coming from yet. The linker script quite clearly puts a ".xuk_stub32" segment first, and the object file looks correct...

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

BTW both gcc 7 and 8 are available on this distro:

apt-get install gcc-8

<some once-off update-alternatives magic, Google for details>

update-alternatives --config gcc

There are 2 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path            Priority   Status
------------------------------------------------------------
  0            /usr/bin/gcc-8   800       auto mode
  1            /usr/bin/gcc-7   700       manual mode
* 2            /usr/bin/gcc-8   800       manual mode

Not sure about the linker implications though.

@andyross
Copy link
Contributor

andyross commented May 3, 2019

Heh, contents of the garbage:

00001000  2f 6c 69 62 2f 6c 64 2d  6c 69 6e 75 78 2e 73 6f  |/lib/ld-linux.so|
00001010  2e 32 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |.2..............|
00001020  00 00 00 00 00 00 00 00  01 00 00 00 01 00 00 00  |................|
00001030  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Absolutely no idea what this is, but that string tells me 100% it was inserted by the toolchain and not Zephyr's build code. :)

@andyross
Copy link
Contributor

andyross commented May 3, 2019

Yeah, something is wonky on Ubuntu. The toolchain (binutils, not gcc per se) is emitting stuff into the linked binaries that Just Shouldn't Be There.

The garbage bytes above are reported in the map file as the contents of the .interp, .gnu.version_d, .gnu.version, .gnu_version_r, .dynsym, .dynstr, .gnu_hash and .eh_frame sections. But the linker script does not reference these seconds, and in fact the only linked object file (xuk-stub32.o) does not even contain these sections.

I can write some hacky lines in the linker script to bin these into a garbage segment, and with that the code gets through its 32 bit bootstrapping. It fails on entry into 64 bit mode, I'm guessing, for exactly the same reason.

That will probably do for now if I can finish up the workaround. But... I'm at a complete loss here. I don't know where these things are coming from, at all. At a guess... maybe it's an interaction with how Ubuntu has configured LTO in teh compiler (that is, the linker needs to generate code, so it needs to be prepared to emit stuff like this, and isn't correctly turning it off or something). And yeah, I checked: -fno-lto does nothing.

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

@andyross have you tried downgrading binutils? In /var/log/apt/history.log on your system:

Start-Date: 2019-05-02  14:16:48
Commandline: apt-get upgrade
Upgrade: binutils:amd64 (2.30-21ubuntu1~18.04, 2.30-21ubuntu1~18.04.1), 
 binutils-x86-64-linux-gnu:amd64 (2.30-21ubuntu1~18.04, 2.30-21ubuntu1~18.04.1)
    etc.

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

Should be especially easy to downgrade considering Ubuntu websites' still point at the previous version:
https://packages.ubuntu.com/bionic/binutils

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

considering Ubuntu websites' still point at the previous version: https://packages.ubuntu.com/bionic/binutils

Wrong site, correction: https://packages.ubuntu.com/bionic-updates/binutils
So now we have a changelog: http://changelogs.ubuntu.com/changelogs/pool/main/b/binutils/binutils_2.30-21ubuntu1~18.04.1/changelog
... which looks like a no-op. The mystery deepens.

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

I found the (very slightly) older packages there:
https://launchpad.net/ubuntu/bionic/amd64/binutils-x86-64-linux-gnu/2.30-21ubuntu1~18.04/
I downgraded them all:

dpkg -l *binutils*

ii  binutils                              2.30-21ubuntu1~18.04                        amd64        GNU assembler, linker and binary utilities
ii  binutils-common:amd64                 2.30-21ubuntu1~18.04                        amd64        Common files for the GNU assembler, linker and binary 
ii  binutils-multiarch                    2.30-21ubuntu1~18.04                        amd64        Binary utilities that support multi-arch targets
ii  binutils-x86-64-linux-gnu             2.30-21ubuntu1~18.04                        amd64        GNU binary utilities, for x86-64-linux-gnu target
ii  libbinutils:amd64                     2.30-21ubuntu1~18.04                        amd64        GNU binary utilities (private shared library)

Yet I still get this:

objdump -h ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf:     file format elf32-i386

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .interp       00000013  00100000  00100000  00001000  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .dynsym       00000010  00100014  00100014  00001014  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .dynstr       00000001  00100024  00100024  00001024  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .gnu.hash     00000018  00100028  00100028  00001028  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .stub32       0000026c  00100040  00100040  00001040  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  5 .rel.dyn      00000008  001002ac  001002ac  000012ac  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 .dynamic      00000088  001002c0  001002c0  000012c0  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  7 .comment      00000029  00000000  00000000  00001348  2**0
                  CONTENTS, READONLY

Ubuntu binutils decoder ring:

file $(dpkg -L binutils)

/usr/bin/ld:                                 symbolic link to x86_64-linux-gnu-ld
/usr/bin/ld.bfd:                             symbolic link to x86_64-linux-gnu-ld.bfd
/usr/bin/ld.gold:                            symbolic link to x86_64-linux-gnu-ld.gold
/usr/bin/objcopy:                            symbolic link to x86_64-linux-gnu-objcopy
/usr/bin/strip:                              symbolic link to x86_64-linux-gnu-strip
...

dpkg -L binutils-x86-64-linux-gnu

/usr/bin/x86_64-linux-gnu-gold
/usr/bin/x86_64-linux-gnu-ld
...
/usr/bin/x86_64-linux-gnu-objcopy
 diverted by binutils-multiarch to: /usr/bin/x86_64-linux-gnu-objcopy.single
/usr/bin/x86_64-linux-gnu-strip
 diverted by binutils-multiarch to: /usr/bin/x86_64-linux-gnu-strip.single
/usr/share/man/man1/x86_64-linux*
...

dpkg -L libbinutils

/usr/lib/x86_64-linux-gnu/libbfd-2.30-system.so
/usr/lib/x86_64-linux-gnu/libopcodes-2.30-system.so

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

Adding -static to the gcc linker invocation avoids the spurious sections and produces the same sections than on Fedora (which doesn't need -static). Works with both gcc-7 and gcc-8:

gcc-8 -static -m32 -ffreestanding -fno-pic -fno-asynchronous-unwind-tables -mno-sse -mno-red-zone -Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles -T ../../arch/x86_64/core/xuk-stub32.ld ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.o -o ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

objdump -h ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf:     file format elf32-i386

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .stub32       0000026c  00100000  00100000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  1 .comment      00000029  00000000  00000000  0000126c  2**0
                  CONTENTS, READONLY

@andyross
Copy link
Contributor

andyross commented May 3, 2019

Hilarious. So... that sorta fits within the LTO theory. Normally "-static" is just about library selection, and of course there are no libraries involved in this link (we don't even pull anything from libgcc). But... yeah, maybe it's become a flag to the linker to disable some LTO generation that otherwise is assumed to be needed?

If nothing else, this is a cleaner workaround than mine. You want to roll the patch or should I do it?

@marc-hb
Copy link
Collaborator

marc-hb commented May 3, 2019

Please do, considering my limited understanding I'd much rather you keep the git blame on all this. I just gave this hack a very quick sanitycheck and that passed. It may also burn your house down (but it didn't seem to break Fedora)

--- a/arch/x86_64/core/CMakeLists.txt
+++ b/arch/x86_64/core/CMakeLists.txt
@@ -45,6 +45,7 @@ add_custom_command(
     -o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
   COMMAND ${CMAKE_C_COMPILER} -m32 ${X86_64_BASE_CFLAGS}
     -Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles
+    -static
     -T ${CMAKE_CURRENT_SOURCE_DIR}/xuk-stub32.ld
     ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
     -o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.elf

But... yeah, maybe it's become a flag to the linker to disable some LTO generation that otherwise is assumed to be needed?

Any more advanced theory as to why Fedora doesn't need it? LTO off by default or something?

@marc-hb
Copy link
Collaborator

marc-hb commented May 4, 2019

I was focusing on the linker and didn't see these major gcc upgrades at the same time:

Start-Date: 2019-05-02  14:16:48
Commandline: apt-get upgrade

gcc-7:amd64      (7.3.0-27ubuntu1~18.04, 7.4.0-1ubuntu1~18.04)
gcc-7-base:amd64 (7.3.0-27ubuntu1~18.04, 7.4.0-1ubuntu1~18.04)

gcc-8-base:amd64 (8.2.0-1ubuntu2~18.04, 8.3.0-6ubuntu1~18.04)
etc.

@marc-hb
Copy link
Collaborator

marc-hb commented May 4, 2019

After stracing ld invocations and comparing various gcc versions I found the main difference between Ubuntu and Fedora in this particular context:

gcc -v
    ...      --enable-default-pie   ...

gcc -no-pie also works: it produces the exact same and working xuk-stub32.elf binary as gcc -static. So maybe -no-pie is preferable here as it causes less change in the linker invocation? More "future-proof"?

Now this still doesn't seem to explain the very recent regression because Ubuntu says they've been turning on --enable-default-pie for a couple years.

PS: no LTO options difference observed between Ubuntu and Fedora.

@andyross
Copy link
Contributor

andyross commented May 6, 2019

Ah... that's the magic. I love -no-pie because it's clearly correct per the docs. We are not, in fact, generating a position independent executable. So no need to call out a hack or workaround. And nice work digging out the underlying ld command line -- the only real platform difference is that Ubuntu started (mid-LTS, sigh...) to link PIE by default, I guess?

But this remains a linker bug, IMHO. I mean, if the linker wants to emit a few PLT relocation entries for the entry point or whatever, that's fine. They go in their own section. The behavior that killed us wasn't that it was emitting useless junk, but that it was including it in the link. We had our own linker script! It's supposed to include what we ask for! And it's not like it's special behavior for the "default" segment or something, it's a custom-defined segment of ours named "stub32". Likewise it's not like it's special to the default ELF output, we're defining our own PHDR! Ugh. But at least it's understood now. Thanks.

andyross pushed a commit to andyross/zephyr that referenced this issue May 6, 2019
Within the past few days, an update to the Ubuntu 18.04 toolchain has
begun emitting code sections during link that are messing with our
stub generation.  They are appearing in the 32 bit stub link despite
not being defined in the single object file, and (worse) being
included in the output segment (i.e. at the start of the bootloader
entry point!) despite not being specifically included by the linker
script.  I don't understand this behavior at all, and it appears to be
directly contrary to the way the linker is documented.

Marc Herbert discovered this was down to gcc being called with
--enable-default-pie, so -no-pie works to suppress this behavior and
restore the default.  And it's correct: we aren't actually generating
a position independent executable, even if we don't understand why the
linker script is being disregarded (to include sections we don't
include).  See discussion in the linked github issue.

Fixes zephyrproject-rtos#15877

Signed-off-by: Andy Ross <[email protected]>
nashif pushed a commit that referenced this issue May 6, 2019
Within the past few days, an update to the Ubuntu 18.04 toolchain has
begun emitting code sections during link that are messing with our
stub generation.  They are appearing in the 32 bit stub link despite
not being defined in the single object file, and (worse) being
included in the output segment (i.e. at the start of the bootloader
entry point!) despite not being specifically included by the linker
script.  I don't understand this behavior at all, and it appears to be
directly contrary to the way the linker is documented.

Marc Herbert discovered this was down to gcc being called with
--enable-default-pie, so -no-pie works to suppress this behavior and
restore the default.  And it's correct: we aren't actually generating
a position independent executable, even if we don't understand why the
linker script is being disregarded (to include sections we don't
include).  See discussion in the linked github issue.

Fixes #15877

Signed-off-by: Andy Ross <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented May 7, 2019

I downgraded all the packages below and went back in Zephyr history (to commit 154c20c) yet the issue (and the fix) reproduce exactly the same, no difference whatsoever. So while everyone is happy with the fix, no one knows why and how the regression happened.

 binutils 2.30-21ubuntu1~18.04  GNU assembler, linker and binary utilities
 binutils-common:amd64 2.30-21ubuntu1~18.04  Common files for the GNU assembler, linker and binary utilities
 binutils-multiarch 2.30-21ubuntu1~18.04  Binary utilities that support multi-arch targets
 binutils-x86-64-linux-gnu 2.30-21ubuntu1~18.04  GNU binary utilities, for x86-64-linux-gnu target
 libbinutils:amd64 2.30-21ubuntu1~18.04  GNU binary utilities (private shared library)

 cpp-7 7.3.0-27ubuntu1~18.04  GNU C preprocessor
 g++-7 7.3.0-27ubuntu1~18.04  GNU C++ compiler
 gcc-7 7.3.0-27ubuntu1~18.04  GNU C compiler
 gcc-7-base:amd64 7.3.0-27ubuntu1~18.04  GCC, the GNU Compiler Collection (base package)
 gcc-7-multilib 7.3.0-27ubuntu1~18.04  GNU C compiler (multilib support)
 libgcc-7-dev:amd64 7.3.0-27ubuntu1~18.04  GCC support library (development files)
 lib32gcc-7-dev 7.3.0-27ubuntu1~18.04  GCC support library (32 bit development files)
 libx32gcc-7-dev 7.3.0-27ubuntu1~18.04  GCC support library (x32 development files)
 libstdc++-7-dev:amd64 7.3.0-27ubuntu1~18.04  GNU Standard C++ Library v3 (development files)

 gcc-8-base:amd64 8.2.0-1ubuntu2~18.04  GCC, the GNU Compiler Collection (base package)
 libgcc1:amd64 1:8.2.0-1ubuntu2~18.04  GCC support library
 libstdc++6:amd64 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3
 libmpx2:amd64 8.2.0-1ubuntu2~18.04  Intel memory protection extensions (runtime)
 lib32gcc1 1:8.2.0-1ubuntu2~18.04  GCC support library (32 bit Version)
 lib32stdc++6 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3 (32 bit Version)
 lib32mpx2 8.2.0-1ubuntu2~18.04  Intel memory protection extensions (32bit)
 libx32gcc1 1:8.2.0-1ubuntu2~18.04  GCC support library (x32)
 libx32stdc++6 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3 (x32)

@andyross
Copy link
Contributor

andyross commented May 7, 2019

Heh, your persistence is amazing. I mean, personally I'm fine with a level of understanding that stops at "distro toolchains are voodoo" and with a final fix that amounts to "put our own toolchain into the SDK where we can control the configuration".

But... I dunno. Maybe it's possible this has never worked on 18.04 and and we just plain never noticed? I mean... I'm absolutely sure this was running on gale over thousands of runs a few weeks back. But... I know I was doing some mix of native builds and ones in a Docker built to emulate the CI environment (so in this case using the 16.04 host compiler), and I know there were a few other bugs fixed in the Ubuntu toolchains. So maybe after fixing the final 16.04 bug I never went back and ran it in the host natively? It's... possible.

I mean, if I had to put money down, this feels like a regression. You, Andrew and Flavio all reported the same failure within a day of each other. But... I don't know that I'd give that scenario better than 5:1 odds.

@marc-hb
Copy link
Collaborator

marc-hb commented May 8, 2019

Heh, your persistence is amazing.

"was" :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants