Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel fails to boot (MBR) when built with gcc 10+ or upgraded to 6.7+ #83

Closed
stapelberg opened this issue Jan 10, 2024 · 6 comments
Closed
Labels
bug Something isn't working

Comments

@stapelberg
Copy link
Contributor

stapelberg commented Jan 10, 2024

Update: I published a blog post about this issue: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/

rtr7/kernel#434 fails to boot in qemu and on the PC Engines apu2c4. Notably, the kernel doesn’t even seem to start — no “Decompressing linux” message is printed, and SeaBIOS just tries to boot over and over again.

There are multiple triggering conditions, it seems.

Even our current kernel version (6.6.10) fails to boot when built with Debian bullseye instead of Debian buster:

--- i/cmd/rtr7-rebuild-kernel/kernel.go
+++ w/cmd/rtr7-rebuild-kernel/kernel.go
@@ -28,7 +28,7 @@ import (
 )
 
 const dockerFileContents = `
-FROM debian:buster
+FROM debian:bullseye
 
 RUN apt-get update && apt-get install -y crossbuild-essential-arm64 bc libssl-dev bison flex libelf-dev ncurses-dev
 

Looking at the versions:

  • Debian buster contains gcc-8 (8.3.0-6) and binutils 2.31.1-16.
  • Debian bullseye contains gcc-10 (10.2.1-6) and binutils 2.35.2-2.

I also tried Debian buster (gcc-8), but with binutils 2.35.2-2 from bullseye, and that still works.

I then tried Debian buster, but with gcc 10 and binutils 2.35.2-2, and the resulting kernel no longer boots.

I’m suspecting the problem is with the minimal MBR bootloader we use (https://github.com/gokrazy/internal/blob/main/mbr/bootloader.asm), because when telling qemu to boot the Linux kernel directly (without going through SeaBIOS), it does boot up correctly.

I verified that the printed vmlinuz and cmdline.txt LBAs point to the correct location. I also verified that a working kernel, padded to the size of the non-working kernel, still works correctly, so it seems like the size of the file is not an issue.

@stapelberg stapelberg added the bug Something isn't working label Jan 10, 2024
@stapelberg
Copy link
Contributor Author

I also reproduced the problem within Nix: gcc8 (8.5.0) works:

% cat /tmp/oldgcc.nix
with import <nixpkgs> {};
gcc8Stdenv.mkDerivation {
  name="foo";
  buildInputs = [

    bc
    gcc
    flex
    bison
    openssl
    elfutils
    libelf
    ncurses

  ];
}
% nix-shell /tmp/oldgcc.nix
[…]

And with gcc10 (10.4.0), the kernel fails to boot.

With gcc9 (9.5.0), the kernel boots correctly.

So it seems like the problem is triggered by gcc10+.

@stapelberg
Copy link
Contributor Author

I read that one of the main changes in gcc 10 is to enable stack protection by default.

Indeed, building the kernel on debian:bullseye, but with CONFIG_STACKPROTECTOR=n makes it boot.

So I’m suspecting that our bootloader does not set up the stack correctly.

I don’t know what the connection to Linux 6.7+ is yet, though.

@stapelberg
Copy link
Contributor Author

I was wondering why I couldn’t get SeaBIOS debug output to show up in qemu. Turns out that when I don’t use Arch’s qemu 8.1.2, but the qemu 7.2.0 I’m using on router7, I do get SeaBIOS debug output on stdout 🤦 Maybe a bug in newer versions, or the configuration changed.

I attached the working and broken SeaBIOS debug output: qemu-boot.broken.txt, qemu-boot.working.txt

The diff is:

% diff -u /tmp/qemu-boot.working.txt /tmp/qemu-boot.broken.txt
--- /tmp/qemu-boot.working.txt	2024-01-13 08:14:14.705313715 +0100
+++ /tmp/qemu-boot.broken.txt	2024-01-13 08:14:22.355448379 +0100
@@ -1,4 +1,4 @@
-/tmp/qemu/bin/qemu-system-x86_64 -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1986672184,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios      
+/tmp/qemu/bin/qemu-system-x86_64 -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1338272335,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -s -S 
 qemu-system-x86_64: warning: hub 0 is not connected to host network
 VNC server running on ::1:5900
 SeaBIOS (version rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org)
@@ -106,5 +106,115 @@
   NULL
 Booting from Hard Disk...
 Booting from 0000:7c00
-VBE mode info request: 100
+In resume (status=0)
+In 32bit resume
+Attempting a hard reboot

[…]

@stapelberg
Copy link
Contributor Author

stapelberg commented Jan 13, 2024

Speaking of working with older software versions, here’s how to start a Docker container with Debian stretch, which contains qemu 2.8, a version in which single-stepping through the MBR works out of the box (bug report regarding more recent versions: https://gitlab.com/qemu-project/qemu/-/issues/141):

% docker run --net=host -v /tmp:/tmp -ti debian:stretch

root@650a2157f663:/# cat > /etc/apt/sources.list <<'EOT'
deb http://archive.debian.org/debian/ stretch contrib main non-free
deb http://archive.debian.org/debian-security/ stretch/updates main
EOT

root@650a2157f663:/# apt update
root@650a2157f663:/# apt install qemu-system-x86
root@650a2157f663:/# qemu-system-i386 -nographic -boot order=c,reboot-timeout=5000 -drive file=/tmp/gokr-boot1338272335,format=raw -net nic,macaddr=b8:27:eb:12:34:56 -usb -s -S

Then, on the host:

% gdb
(gdb) set architecture i8086
(gdb) target remote localhost:1234
(gdb) symbol-file bootloader.elf
(gdb) layout split
(gdb) layout src
(gdb) layout regs
(gdb) b *0x7c00
(gdb) c

We can verify the kernel command line is loaded from cmd_lba to 0x1e000:

(gdb) b read_kernel_bootsector
(gdb) x/s 0x1e000

To understand the program flow, I set up breakpoints at each function of the bootloader:

b read_kernel_setup
b check_version
b read_protected_mode_kernel
b read_protected_mode_kernel_2
b run_kernel
b error
b reboot

The list of functions that are run with the working kernel:

(gdb) b read_kernel_setup
Breakpoint 2 at 0x7c38: file bootloader.asm, line 69.
(gdb) b check_version
Breakpoint 3 at 0x7c56: file bootloader.asm, line 82.
(gdb) b read_protected_mode_kernel
Breakpoint 4 at 0x7c8f: file bootloader.asm, line 99.
(gdb) b read_protected_mode_kernel_2
Breakpoint 5 at 0x7cd6: file bootloader.asm, line 120.
(gdb) b run_kernel
Breakpoint 6 at 0x7cff: file bootloader.asm, line 136.
(gdb) b error
Breakpoint 7 at 0x7d51: file bootloader.asm, line 184.
(gdb) b reboot
Breakpoint 8 at 0x7d62: file bootloader.asm, line 198.
(gdb) c
Continuing.

Breakpoint 2, read_kernel_setup () at bootloader.asm:69
69		xor	eax, eax
(gdb) c
Continuing.

Breakpoint 3, check_version () at bootloader.asm:82
82		cmp	word [es:0x206], 0x204		; we need protocol version >= 2.04
(gdb) c
Continuing.

Breakpoint 4, read_protected_mode_kernel () at bootloader.asm:99
99		mov	edx, [es:0x1f4]			; edx stores the number of bytes to load
(gdb) c
Continuing.

Breakpoint 5, read_protected_mode_kernel_2 () at bootloader.asm:120
120		mov	eax, edx
(gdb) c
Continuing.

Breakpoint 6, run_kernel () at bootloader.asm:136
136		cli			; disable interrupts
(gdb) 

The list of functions that are run with the broken kernel:

(gdb) b read_kernel_setup
Breakpoint 2 at 0x7c38: file bootloader.asm, line 69.
(gdb) b check_version
Breakpoint 3 at 0x7c56: file bootloader.asm, line 82.
(gdb) b read_protected_mode_kernel
Breakpoint 4 at 0x7c8f: file bootloader.asm, line 99.
(gdb) b read_protected_mode_kernel_2
Breakpoint 5 at 0x7cd6: file bootloader.asm, line 120.
(gdb) b run_kernel
Breakpoint 6 at 0x7cff: file bootloader.asm, line 136.
(gdb) b error
Breakpoint 7 at 0x7d51: file bootloader.asm, line 184.
(gdb) b reboot
Breakpoint 8 at 0x7d62: file bootloader.asm, line 198.
(gdb) c
Continuing.

Breakpoint 2, read_kernel_setup () at bootloader.asm:69
69		xor	eax, eax
(gdb) c
Continuing.

Breakpoint 3, check_version () at bootloader.asm:82
82		cmp	word [es:0x206], 0x204		; we need protocol version >= 2.04
(gdb) c
Continuing.

Breakpoint 4, read_protected_mode_kernel () at bootloader.asm:99
99		mov	edx, [es:0x1f4]			; edx stores the number of bytes to load
(gdb) c
Continuing.

Breakpoint 1, ?? () at bootloader.asm:42
42		cli

So the problem seems to be with loading the kernel from disk.

@stapelberg
Copy link
Contributor Author

stapelberg commented Jan 13, 2024

Broken read_protected_mode_kernel:

  • edx at line 102 is 15840256
  • b 102 (read_protected_mode_kernel.loop), continue a few times
  • cond 10 $edx == 104448
  • this do_move call does not return

Theory: extended mode is limited to 15 MB, and with the stack protector enabled, our kernel newly exceeds 15 MB of data to copy.

I had previously tried padding the kernel to figure out if the size plays a role, but that happened at the wrong level: I padded the vmlinuz file in the FAT file system, but the relevant size is determined by the kernel header data structure, which contains the number of bytes the bootloader will copy.


Here are some resources I found helpful:

stapelberg pushed a commit to rtr7/kernel that referenced this issue Jan 13, 2024
@stapelberg
Copy link
Contributor Author

stapelberg commented Feb 11, 2024

I wrote a blog post about this failure: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/

I’ll close this issue in favor of a tracking bug in the gokrazy repository about longer-term MBR bootloader changes: gokrazy/gokrazy#248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant