all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

andrewboie · 2019-05-03T16:19:54Z

The last message, in all cases, is

Booting from ROM..

My environment:

Linux yggdrasil 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0

The text was updated successfully, but these errors were encountered:

andyross · 2019-05-03T16:24:20Z

@marc-hb and @ceolin have stumbled onto the same thing. Seems to be due to changes in the host toolchain over the last few days. Currently updated F28, F29 and Clear all seem to be producing valid binaries, which I can then copy over to my 18.04 environment and run successfully with the qemu (host and SDK) there.

andyross · 2019-05-03T16:36:31Z

Looks like a linkage issue. Here's the 32 bit boot stub (see xuk-stub32.c) at the beginning of a working binary (this is from a xuk.elf unit test, not a Zephyr binary per se):

Disassembly of section .text:

00100000 <__bss_end-0x1d2c>:
  100000:       bc 00 50 00 00          mov    $0x5000,%esp
  100005:       31 d2                   xor    %edx,%edx
  100007:       3d f7 8d f0 aa          cmp    $0xaaf08df7,%eax
  10000c:       0f 44 a2 00 40 00 00    cmove  0x4000(%edx),%esp
  100013:       53                      push   %ebx
  100014:       50                      push   %eax
  100015:       e8 bc 03 00 00          call   1003d6 <__bss_end-0x1956>

Basically it sets a stack pointer to one of two hard-configured stacks depending on whether this is a initial (multiboot) startup or an auxilliary SMP CPU, and then enters cstart().

Here's the same entry code for the failing binaries:

00100000 <.text>:
  100000:       2f                      das    
  100001:       6c                      insb   (%dx),%es:(%edi)
  100002:       69 62 2f 6c 64 2d 6c    imul   $0x6c2d646c,0x2f(%edx),%esp
  100009:       69 6e 75 78 2e 73 6f    imul   $0x6f732e78,0x75(%esi),%ebp
  100010:       2e 32 00                xor    %cs:(%eax),%al
        ...
  100027:       00 01                   add    %al,(%ecx)
  100029:       00 00                   add    %al,(%eax)
  10002b:       00 01                   add    %al,(%ecx)
  10002d:       00 00                   add    %al,(%eax)
  10002f:       00 01                   add    %al,(%ecx)
        ...
  10003d:       00 00                   add    %al,(%eax)
  10003f:       00 bc 00 50 00 00 31    add    %bh,0x31000050(%eax,%eax,1)
  100046:       d2 3d f7 8d f0 aa       sarb   %cl,0xaaf08df7

There's 64 bytes of garbage prepended! Note that the "bc 00 50 00 00" sequence of the initial MOV is present at offset 0x100040 (though it's obviously being mis-disassembled here).

No idea where that's coming from yet. The linker script quite clearly puts a ".xuk_stub32" segment first, and the object file looks correct...

marc-hb · 2019-05-03T16:40:59Z

BTW both gcc 7 and 8 are available on this distro:

apt-get install gcc-8

<some once-off update-alternatives magic, Google for details>

update-alternatives --config gcc

There are 2 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path            Priority   Status
------------------------------------------------------------
  0            /usr/bin/gcc-8   800       auto mode
  1            /usr/bin/gcc-7   700       manual mode
* 2            /usr/bin/gcc-8   800       manual mode

Not sure about the linker implications though.

andyross · 2019-05-03T16:44:06Z

Heh, contents of the garbage:

00001000  2f 6c 69 62 2f 6c 64 2d  6c 69 6e 75 78 2e 73 6f  |/lib/ld-linux.so|
00001010  2e 32 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |.2..............|
00001020  00 00 00 00 00 00 00 00  01 00 00 00 01 00 00 00  |................|
00001030  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Absolutely no idea what this is, but that string tells me 100% it was inserted by the toolchain and not Zephyr's build code. :)

andyross · 2019-05-03T18:20:37Z

Yeah, something is wonky on Ubuntu. The toolchain (binutils, not gcc per se) is emitting stuff into the linked binaries that Just Shouldn't Be There.

The garbage bytes above are reported in the map file as the contents of the .interp, .gnu.version_d, .gnu.version, .gnu_version_r, .dynsym, .dynstr, .gnu_hash and .eh_frame sections. But the linker script does not reference these seconds, and in fact the only linked object file (xuk-stub32.o) does not even contain these sections.

I can write some hacky lines in the linker script to bin these into a garbage segment, and with that the code gets through its 32 bit bootstrapping. It fails on entry into 64 bit mode, I'm guessing, for exactly the same reason.

That will probably do for now if I can finish up the workaround. But... I'm at a complete loss here. I don't know where these things are coming from, at all. At a guess... maybe it's an interaction with how Ubuntu has configured LTO in teh compiler (that is, the linker needs to generate code, so it needs to be prepared to emit stuff like this, and isn't correctly turning it off or something). And yeah, I checked: -fno-lto does nothing.

marc-hb · 2019-05-03T19:16:08Z

@andyross have you tried downgrading binutils? In /var/log/apt/history.log on your system:

Start-Date: 2019-05-02  14:16:48
Commandline: apt-get upgrade
Upgrade: binutils:amd64 (2.30-21ubuntu1~18.04, 2.30-21ubuntu1~18.04.1), 
 binutils-x86-64-linux-gnu:amd64 (2.30-21ubuntu1~18.04, 2.30-21ubuntu1~18.04.1)
    etc.

marc-hb · 2019-05-03T19:41:19Z

Should be especially easy to downgrade considering Ubuntu websites' still point at the previous version:
https://packages.ubuntu.com/bionic/binutils

marc-hb · 2019-05-03T20:55:44Z

considering Ubuntu websites' still point at the previous version: https://packages.ubuntu.com/bionic/binutils

Wrong site, correction: https://packages.ubuntu.com/bionic-updates/binutils
So now we have a changelog: http://changelogs.ubuntu.com/changelogs/pool/main/b/binutils/binutils_2.30-21ubuntu1~18.04.1/changelog
... which looks like a no-op. The mystery deepens.

marc-hb · 2019-05-03T22:50:16Z

I found the (very slightly) older packages there:
https://launchpad.net/ubuntu/bionic/amd64/binutils-x86-64-linux-gnu/2.30-21ubuntu1~18.04/
I downgraded them all:

dpkg -l *binutils*

ii  binutils                              2.30-21ubuntu1~18.04                        amd64        GNU assembler, linker and binary utilities
ii  binutils-common:amd64                 2.30-21ubuntu1~18.04                        amd64        Common files for the GNU assembler, linker and binary 
ii  binutils-multiarch                    2.30-21ubuntu1~18.04                        amd64        Binary utilities that support multi-arch targets
ii  binutils-x86-64-linux-gnu             2.30-21ubuntu1~18.04                        amd64        GNU binary utilities, for x86-64-linux-gnu target
ii  libbinutils:amd64                     2.30-21ubuntu1~18.04                        amd64        GNU binary utilities (private shared library)

Yet I still get this:

objdump -h ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf:     file format elf32-i386

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .interp       00000013  00100000  00100000  00001000  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .dynsym       00000010  00100014  00100014  00001014  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .dynstr       00000001  00100024  00100024  00001024  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .gnu.hash     00000018  00100028  00100028  00001028  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .stub32       0000026c  00100040  00100040  00001040  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  5 .rel.dyn      00000008  001002ac  001002ac  000012ac  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 .dynamic      00000088  001002c0  001002c0  000012c0  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  7 .comment      00000029  00000000  00000000  00001348  2**0
                  CONTENTS, READONLY

Ubuntu binutils decoder ring:

file $(dpkg -L binutils)

/usr/bin/ld:                                 symbolic link to x86_64-linux-gnu-ld
/usr/bin/ld.bfd:                             symbolic link to x86_64-linux-gnu-ld.bfd
/usr/bin/ld.gold:                            symbolic link to x86_64-linux-gnu-ld.gold
/usr/bin/objcopy:                            symbolic link to x86_64-linux-gnu-objcopy
/usr/bin/strip:                              symbolic link to x86_64-linux-gnu-strip
...

dpkg -L binutils-x86-64-linux-gnu

/usr/bin/x86_64-linux-gnu-gold
/usr/bin/x86_64-linux-gnu-ld
...
/usr/bin/x86_64-linux-gnu-objcopy
 diverted by binutils-multiarch to: /usr/bin/x86_64-linux-gnu-objcopy.single
/usr/bin/x86_64-linux-gnu-strip
 diverted by binutils-multiarch to: /usr/bin/x86_64-linux-gnu-strip.single
/usr/share/man/man1/x86_64-linux*
...

dpkg -L libbinutils

/usr/lib/x86_64-linux-gnu/libbfd-2.30-system.so
/usr/lib/x86_64-linux-gnu/libopcodes-2.30-system.so

marc-hb · 2019-05-03T23:38:31Z

Adding -static to the gcc linker invocation avoids the spurious sections and produces the same sections than on Fedora (which doesn't need -static). Works with both gcc-7 and gcc-8:

gcc-8 -static -m32 -ffreestanding -fno-pic -fno-asynchronous-unwind-tables -mno-sse -mno-red-zone -Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles -T ../../arch/x86_64/core/xuk-stub32.ld ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.o -o ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

objdump -h ./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf

./q64/zephyr/arch/arch/x86_64/core/xuk-stub32.elf:     file format elf32-i386

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .stub32       0000026c  00100000  00100000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  1 .comment      00000029  00000000  00000000  0000126c  2**0
                  CONTENTS, READONLY

andyross · 2019-05-03T23:42:44Z

Hilarious. So... that sorta fits within the LTO theory. Normally "-static" is just about library selection, and of course there are no libraries involved in this link (we don't even pull anything from libgcc). But... yeah, maybe it's become a flag to the linker to disable some LTO generation that otherwise is assumed to be needed?

If nothing else, this is a cleaner workaround than mine. You want to roll the patch or should I do it?

marc-hb · 2019-05-03T23:57:57Z

Please do, considering my limited understanding I'd much rather you keep the git blame on all this. I just gave this hack a very quick sanitycheck and that passed. It may also burn your house down (but it didn't seem to break Fedora)

--- a/arch/x86_64/core/CMakeLists.txt
+++ b/arch/x86_64/core/CMakeLists.txt
@@ -45,6 +45,7 @@ add_custom_command(
     -o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
   COMMAND ${CMAKE_C_COMPILER} -m32 ${X86_64_BASE_CFLAGS}
     -Wl,--build-id=none -nostdlib -nodefaultlibs -nostartfiles
+    -static
     -T ${CMAKE_CURRENT_SOURCE_DIR}/xuk-stub32.ld
     ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.o
     -o ${CMAKE_CURRENT_BINARY_DIR}/xuk-stub32.elf

But... yeah, maybe it's become a flag to the linker to disable some LTO generation that otherwise is assumed to be needed?

Any more advanced theory as to why Fedora doesn't need it? LTO off by default or something?

marc-hb · 2019-05-04T07:32:47Z

I was focusing on the linker and didn't see these major gcc upgrades at the same time:

Start-Date: 2019-05-02  14:16:48
Commandline: apt-get upgrade

gcc-7:amd64      (7.3.0-27ubuntu1~18.04, 7.4.0-1ubuntu1~18.04)
gcc-7-base:amd64 (7.3.0-27ubuntu1~18.04, 7.4.0-1ubuntu1~18.04)

gcc-8-base:amd64 (8.2.0-1ubuntu2~18.04, 8.3.0-6ubuntu1~18.04)
etc.

marc-hb · 2019-05-04T07:56:12Z

After stracing ld invocations and comparing various gcc versions I found the main difference between Ubuntu and Fedora in this particular context:

gcc -v
    ...      --enable-default-pie   ...

gcc -no-pie also works: it produces the exact same and working xuk-stub32.elf binary as gcc -static. So maybe -no-pie is preferable here as it causes less change in the linker invocation? More "future-proof"?

Now this still doesn't seem to explain the very recent regression because Ubuntu says they've been turning on --enable-default-pie for a couple years.

PS: no LTO options difference observed between Ubuntu and Fedora.

andyross · 2019-05-06T16:54:42Z

Ah... that's the magic. I love -no-pie because it's clearly correct per the docs. We are not, in fact, generating a position independent executable. So no need to call out a hack or workaround. And nice work digging out the underlying ld command line -- the only real platform difference is that Ubuntu started (mid-LTS, sigh...) to link PIE by default, I guess?

But this remains a linker bug, IMHO. I mean, if the linker wants to emit a few PLT relocation entries for the entry point or whatever, that's fine. They go in their own section. The behavior that killed us wasn't that it was emitting useless junk, but that it was including it in the link. We had our own linker script! It's supposed to include what we ask for! And it's not like it's special behavior for the "default" segment or something, it's a custom-defined segment of ours named "stub32". Likewise it's not like it's special to the default ELF output, we're defining our own PHDR! Ugh. But at least it's understood now. Thanks.

Within the past few days, an update to the Ubuntu 18.04 toolchain has begun emitting code sections during link that are messing with our stub generation. They are appearing in the 32 bit stub link despite not being defined in the single object file, and (worse) being included in the output segment (i.e. at the start of the bootloader entry point!) despite not being specifically included by the linker script. I don't understand this behavior at all, and it appears to be directly contrary to the way the linker is documented. Marc Herbert discovered this was down to gcc being called with --enable-default-pie, so -no-pie works to suppress this behavior and restore the default. And it's correct: we aren't actually generating a position independent executable, even if we don't understand why the linker script is being disregarded (to include sections we don't include). See discussion in the linked github issue. Fixes zephyrproject-rtos#15877 Signed-off-by: Andy Ross <[email protected]>

Within the past few days, an update to the Ubuntu 18.04 toolchain has begun emitting code sections during link that are messing with our stub generation. They are appearing in the 32 bit stub link despite not being defined in the single object file, and (worse) being included in the output segment (i.e. at the start of the bootloader entry point!) despite not being specifically included by the linker script. I don't understand this behavior at all, and it appears to be directly contrary to the way the linker is documented. Marc Herbert discovered this was down to gcc being called with --enable-default-pie, so -no-pie works to suppress this behavior and restore the default. And it's correct: we aren't actually generating a position independent executable, even if we don't understand why the linker script is being disregarded (to include sections we don't include). See discussion in the linked github issue. Fixes #15877 Signed-off-by: Andy Ross <[email protected]>

marc-hb · 2019-05-07T20:27:39Z

I downgraded all the packages below and went back in Zephyr history (to commit 154c20c) yet the issue (and the fix) reproduce exactly the same, no difference whatsoever. So while everyone is happy with the fix, no one knows why and how the regression happened.

 binutils 2.30-21ubuntu1~18.04  GNU assembler, linker and binary utilities
 binutils-common:amd64 2.30-21ubuntu1~18.04  Common files for the GNU assembler, linker and binary utilities
 binutils-multiarch 2.30-21ubuntu1~18.04  Binary utilities that support multi-arch targets
 binutils-x86-64-linux-gnu 2.30-21ubuntu1~18.04  GNU binary utilities, for x86-64-linux-gnu target
 libbinutils:amd64 2.30-21ubuntu1~18.04  GNU binary utilities (private shared library)

 cpp-7 7.3.0-27ubuntu1~18.04  GNU C preprocessor
 g++-7 7.3.0-27ubuntu1~18.04  GNU C++ compiler
 gcc-7 7.3.0-27ubuntu1~18.04  GNU C compiler
 gcc-7-base:amd64 7.3.0-27ubuntu1~18.04  GCC, the GNU Compiler Collection (base package)
 gcc-7-multilib 7.3.0-27ubuntu1~18.04  GNU C compiler (multilib support)
 libgcc-7-dev:amd64 7.3.0-27ubuntu1~18.04  GCC support library (development files)
 lib32gcc-7-dev 7.3.0-27ubuntu1~18.04  GCC support library (32 bit development files)
 libx32gcc-7-dev 7.3.0-27ubuntu1~18.04  GCC support library (x32 development files)
 libstdc++-7-dev:amd64 7.3.0-27ubuntu1~18.04  GNU Standard C++ Library v3 (development files)

 gcc-8-base:amd64 8.2.0-1ubuntu2~18.04  GCC, the GNU Compiler Collection (base package)
 libgcc1:amd64 1:8.2.0-1ubuntu2~18.04  GCC support library
 libstdc++6:amd64 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3
 libmpx2:amd64 8.2.0-1ubuntu2~18.04  Intel memory protection extensions (runtime)
 lib32gcc1 1:8.2.0-1ubuntu2~18.04  GCC support library (32 bit Version)
 lib32stdc++6 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3 (32 bit Version)
 lib32mpx2 8.2.0-1ubuntu2~18.04  Intel memory protection extensions (32bit)
 libx32gcc1 1:8.2.0-1ubuntu2~18.04  GCC support library (x32)
 libx32stdc++6 8.2.0-1ubuntu2~18.04  GNU Standard C++ Library v3 (x32)

andyross · 2019-05-07T20:39:05Z

Heh, your persistence is amazing. I mean, personally I'm fine with a level of understanding that stops at "distro toolchains are voodoo" and with a final fix that amounts to "put our own toolchain into the SDK where we can control the configuration".

But... I dunno. Maybe it's possible this has never worked on 18.04 and and we just plain never noticed? I mean... I'm absolutely sure this was running on gale over thousands of runs a few weeks back. But... I know I was doing some mix of native builds and ones in a Docker built to emulate the CI environment (so in this case using the 16.04 host compiler), and I know there were a few other bugs fixed in the Ubuntu toolchains. So maybe after fixing the final 16.04 bug I never went back and ran it in the host natively? It's... possible.

I mean, if I had to put money down, this feels like a regression. You, Andrew and Flavio all reported the same failure within a day of each other. But... I don't know that I'd give that scenario better than 5:1 odds.

marc-hb · 2019-05-08T20:09:53Z

Heh, your persistence is amazing.

"was" :-)

andrewboie added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels May 3, 2019

andrewboie assigned andyross May 3, 2019

andyross mentioned this issue May 3, 2019

arch/x86_64: Junk spurious linker-generated code in the 32 bit stub #15891

Merged

nashif closed this as completed in #15891 May 6, 2019

marc-h38 mentioned this issue Jan 8, 2020

Better support for bare metal and low level OS targets mesonbuild/meson#6434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

andrewboie commented May 3, 2019

andyross commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019 •

edited

Loading

andyross commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019 •

edited

Loading

marc-hb commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 4, 2019

marc-hb commented May 4, 2019 •

edited

Loading

andyross commented May 6, 2019

marc-hb commented May 7, 2019

andyross commented May 7, 2019

marc-hb commented May 8, 2019

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

all qemu_x86_64 tests hang on Ubuntu 18.04 #15877

Comments

andrewboie commented May 3, 2019

andyross commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019 • edited Loading

andyross commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 3, 2019 • edited Loading

marc-hb commented May 3, 2019

andyross commented May 3, 2019

marc-hb commented May 3, 2019

marc-hb commented May 4, 2019

marc-hb commented May 4, 2019 • edited Loading

andyross commented May 6, 2019

marc-hb commented May 7, 2019

andyross commented May 7, 2019

marc-hb commented May 8, 2019

marc-hb commented May 3, 2019 •

edited

Loading

marc-hb commented May 3, 2019 •

edited

Loading

marc-hb commented May 4, 2019 •

edited

Loading