kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

stapelberg · 2024-01-10T19:23:06Z

Update: I published a blog post about this issue: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/

rtr7/kernel#434 fails to boot in qemu and on the PC Engines apu2c4. Notably, the kernel doesn’t even seem to start — no “Decompressing linux” message is printed, and SeaBIOS just tries to boot over and over again.

There are multiple triggering conditions, it seems.

Even our current kernel version (6.6.10) fails to boot when built with Debian bullseye instead of Debian buster:

--- i/cmd/rtr7-rebuild-kernel/kernel.go
+++ w/cmd/rtr7-rebuild-kernel/kernel.go
@@ -28,7 +28,7 @@ import (
 )
 
 const dockerFileContents = `
-FROM debian:buster
+FROM debian:bullseye
 
 RUN apt-get update && apt-get install -y crossbuild-essential-arm64 bc libssl-dev bison flex libelf-dev ncurses-dev

Looking at the versions:

Debian buster contains gcc-8 (8.3.0-6) and binutils 2.31.1-16.
Debian bullseye contains gcc-10 (10.2.1-6) and binutils 2.35.2-2.

I also tried Debian buster (gcc-8), but with binutils 2.35.2-2 from bullseye, and that still works.

I then tried Debian buster, but with gcc 10 and binutils 2.35.2-2, and the resulting kernel no longer boots.

I’m suspecting the problem is with the minimal MBR bootloader we use (https://github.com/gokrazy/internal/blob/main/mbr/bootloader.asm), because when telling qemu to boot the Linux kernel directly (without going through SeaBIOS), it does boot up correctly.

I verified that the printed vmlinuz and cmdline.txt LBAs point to the correct location. I also verified that a working kernel, padded to the size of the non-working kernel, still works correctly, so it seems like the size of the file is not an issue.

stapelberg · 2024-01-10T19:52:25Z

I also reproduced the problem within Nix: gcc8 (8.5.0) works:

% cat /tmp/oldgcc.nix
with import <nixpkgs> {};
gcc8Stdenv.mkDerivation {
  name="foo";
  buildInputs = [

    bc
    gcc
    flex
    bison
    openssl
    elfutils
    libelf
    ncurses

  ];
}
% nix-shell /tmp/oldgcc.nix
[…]

And with gcc10 (10.4.0), the kernel fails to boot.

With gcc9 (9.5.0), the kernel boots correctly.

So it seems like the problem is triggered by gcc10+.

stapelberg · 2024-01-10T20:05:53Z

I read that one of the main changes in gcc 10 is to enable stack protection by default.

Indeed, building the kernel on debian:bullseye, but with CONFIG_STACKPROTECTOR=n makes it boot.

So I’m suspecting that our bootloader does not set up the stack correctly.

I don’t know what the connection to Linux 6.7+ is yet, though.

stapelberg · 2024-01-13T07:17:06Z

I was wondering why I couldn’t get SeaBIOS debug output to show up in qemu. Turns out that when I don’t use Arch’s qemu 8.1.2, but the qemu 7.2.0 I’m using on router7, I do get SeaBIOS debug output on stdout 🤦 Maybe a bug in newer versions, or the configuration changed.

I attached the working and broken SeaBIOS debug output: qemu-boot.broken.txt, qemu-boot.working.txt

The diff is:

% diff -u /tmp/qemu-boot.working.txt /tmp/qemu-boot.broken.txt
--- /tmp/qemu-boot.working.txt	2024-01-13 08:14:14.705313715 +0100
+++ /tmp/qemu-boot.broken.txt	2024-01-13 08:14:22.355448379 +0100
@@ -1,4 +1,4 @@
-/tmp/qemu/bin/qemu-system-x86_64 -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1986672184,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios      
+/tmp/qemu/bin/qemu-system-x86_64 -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1338272335,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -s -S 
 qemu-system-x86_64: warning: hub 0 is not connected to host network
 VNC server running on ::1:5900
 SeaBIOS (version rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org)
@@ -106,5 +106,115 @@
   NULL
 Booting from Hard Disk...
 Booting from 0000:7c00
-VBE mode info request: 100
+In resume (status=0)
+In 32bit resume
+Attempting a hard reboot

[…]

stapelberg · 2024-01-13T08:08:04Z

Speaking of working with older software versions, here’s how to start a Docker container with Debian stretch, which contains qemu 2.8, a version in which single-stepping through the MBR works out of the box (bug report regarding more recent versions: https://gitlab.com/qemu-project/qemu/-/issues/141):

% docker run --net=host -v /tmp:/tmp -ti debian:stretch

root@650a2157f663:/# cat > /etc/apt/sources.list <<'EOT'
deb http://archive.debian.org/debian/ stretch contrib main non-free
deb http://archive.debian.org/debian-security/ stretch/updates main
EOT

root@650a2157f663:/# apt update
root@650a2157f663:/# apt install qemu-system-x86
root@650a2157f663:/# qemu-system-i386 -nographic -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1338272335,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -s -S

Then, on the host:

% gdb
(gdb) set architecture i8086
(gdb) target remote localhost:1234
(gdb) symbol-file bootloader.elf
(gdb) layout split
(gdb) layout src
(gdb) layout regs
(gdb) b *0x7c00
(gdb) c

We can verify the kernel command line is loaded from cmd_lba to 0x1e000:

(gdb) b read_kernel_bootsector
(gdb) x/s 0x1e000

To understand the program flow, I set up breakpoints at each function of the bootloader:

b read_kernel_setup
b check_version
b read_protected_mode_kernel
b read_protected_mode_kernel_2
b run_kernel
b error
b reboot

The list of functions that are run with the working kernel:

(gdb) b read_kernel_setup
Breakpoint 2 at 0x7c38: file bootloader.asm, line 69.
(gdb) b check_version
Breakpoint 3 at 0x7c56: file bootloader.asm, line 82.
(gdb) b read_protected_mode_kernel
Breakpoint 4 at 0x7c8f: file bootloader.asm, line 99.
(gdb) b read_protected_mode_kernel_2
Breakpoint 5 at 0x7cd6: file bootloader.asm, line 120.
(gdb) b run_kernel
Breakpoint 6 at 0x7cff: file bootloader.asm, line 136.
(gdb) b error
Breakpoint 7 at 0x7d51: file bootloader.asm, line 184.
(gdb) b reboot
Breakpoint 8 at 0x7d62: file bootloader.asm, line 198.
(gdb) c
Continuing.

Breakpoint 2, read_kernel_setup () at bootloader.asm:69
69		xor	eax, eax
(gdb) c
Continuing.

Breakpoint 3, check_version () at bootloader.asm:82
82		cmp	word [es:0x206], 0x204		; we need protocol version >= 2.04
(gdb) c
Continuing.

Breakpoint 4, read_protected_mode_kernel () at bootloader.asm:99
99		mov	edx, [es:0x1f4]			; edx stores the number of bytes to load
(gdb) c
Continuing.

Breakpoint 5, read_protected_mode_kernel_2 () at bootloader.asm:120
120		mov	eax, edx
(gdb) c
Continuing.

Breakpoint 6, run_kernel () at bootloader.asm:136
136		cli			; disable interrupts
(gdb)

The list of functions that are run with the broken kernel:

(gdb) b read_kernel_setup
Breakpoint 2 at 0x7c38: file bootloader.asm, line 69.
(gdb) b check_version
Breakpoint 3 at 0x7c56: file bootloader.asm, line 82.
(gdb) b read_protected_mode_kernel
Breakpoint 4 at 0x7c8f: file bootloader.asm, line 99.
(gdb) b read_protected_mode_kernel_2
Breakpoint 5 at 0x7cd6: file bootloader.asm, line 120.
(gdb) b run_kernel
Breakpoint 6 at 0x7cff: file bootloader.asm, line 136.
(gdb) b error
Breakpoint 7 at 0x7d51: file bootloader.asm, line 184.
(gdb) b reboot
Breakpoint 8 at 0x7d62: file bootloader.asm, line 198.
(gdb) c
Continuing.

Breakpoint 2, read_kernel_setup () at bootloader.asm:69
69		xor	eax, eax
(gdb) c
Continuing.

Breakpoint 3, check_version () at bootloader.asm:82
82		cmp	word [es:0x206], 0x204		; we need protocol version >= 2.04
(gdb) c
Continuing.

Breakpoint 4, read_protected_mode_kernel () at bootloader.asm:99
99		mov	edx, [es:0x1f4]			; edx stores the number of bytes to load
(gdb) c
Continuing.

Breakpoint 1, ?? () at bootloader.asm:42
42		cli

So the problem seems to be with loading the kernel from disk.

stapelberg · 2024-01-13T09:15:56Z

Broken read_protected_mode_kernel:

edx at line 102 is 15840256
b 102 (read_protected_mode_kernel.loop), continue a few times
cond 10 $edx == 104448
this do_move call does not return

Theory: extended mode is limited to 15 MB, and with the stack protector enabled, our kernel newly exceeds 15 MB of data to copy.

I had previously tried padding the kernel to figure out if the size plays a role, but that happened at the wrong level: I padded the vmlinuz file in the FAT file system, but the relevant size is determined by the kernel header data structure, which contains the number of bytes the bootloader will copy.

Here are some resources I found helpful:

related to rtr7/router7#83

stapelberg · 2024-02-11T09:39:43Z

I wrote a blog post about this failure: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/

I’ll close this issue in favor of a tracking bug in the gokrazy repository about longer-term MBR bootloader changes: gokrazy/gokrazy#248

stapelberg added the bug Something isn't working label Jan 10, 2024

stapelberg mentioned this issue Jan 12, 2024

continuous integration: add kexec test cycle gokrazy/gokrazy#243

Open

stapelberg pushed a commit to rtr7/kernel that referenced this issue Jan 13, 2024

auto-update to linux-6.7.tar.xz; enable zstd; upgrade buildenv

65ca3ee

related to rtr7/router7#83

stapelberg closed this as completed Feb 11, 2024

guevara mentioned this issue Feb 14, 2024

Minimal Linux Bootloader debugging story guevara/read-it-later#10887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

stapelberg commented Jan 10, 2024 •

edited

Loading

stapelberg commented Jan 10, 2024

stapelberg commented Jan 10, 2024

stapelberg commented Jan 13, 2024

stapelberg commented Jan 13, 2024 •

edited

Loading

stapelberg commented Jan 13, 2024 •

edited

Loading

stapelberg commented Feb 11, 2024 •

edited

Loading

kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

Comments

stapelberg commented Jan 10, 2024 • edited Loading

stapelberg commented Jan 10, 2024

stapelberg commented Jan 10, 2024

stapelberg commented Jan 13, 2024

stapelberg commented Jan 13, 2024 • edited Loading

stapelberg commented Jan 13, 2024 • edited Loading

stapelberg commented Feb 11, 2024 • edited Loading

stapelberg commented Jan 10, 2024 •

edited

Loading

stapelberg commented Jan 13, 2024 •

edited

Loading

stapelberg commented Jan 13, 2024 •

edited

Loading

stapelberg commented Feb 11, 2024 •

edited

Loading