You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I maintain two builds of the Linux kernel, a linux/arm64 build for gokrazy,
my Go appliance platform, which started out on the
Raspberry Pi, and then a linux/amd64 one for router7,
which runs on PCs.
The update process for both of these builds is entirely automated, meaning new
Linux kernel releases are automatically tested and merged, but recently the
continuous integration testing failed to automatically merge Linux
6․7 — this article is about tracking
down the root cause of that failure.
Background info on the bootloader
gokrazy started out targeting only the Raspberry Pi, where you configure the
bootloader with a plain text file on a FAT partition, so we did not need to
include our own UEFI/MBR bootloader.
When I ported gokrazy to work on PCs in BIOS mode, I decided against complicated
solutions like GRUB — I really wasn’t looking to maintain a GRUB package. Just
keeping GRUB installations working on my machines is enough work. The fact that
GRUB consists of many different files (modules) that can go out of sync really
does not appeal to me.
For UEFI, there is systemd-boot,
which comes as a single-file UEFI program, easy to include. That’s how gokrazy
supports UEFI boot. Unfortunately, the PC Engines apu2c4 does not support UEFI,
so I also needed an MBR solution.
Instead, I went with Sebastian Plotz’s Minimal Linux
Bootloader because it fits
entirely into the Master Boot Record
(MBR) and does not require
any files. In bootloader lingo, this is a stage1-only bootloader. You don’t even
need a C compiler to compile its (Assembly) code. It seemed simple enough to
integrate: just write the bootloader code into the first sector of the gokrazy
disk image; done. The bootloader had its last release in 2012, so no need for
updates or maintenance.
You can’t really implement booting a kernel and parsing text configuration
files in 446
bytes of 16-bit
8086 assembly instructions, so to tell the bootloader where on disk to load the
kernel code and kernel command line from, gokrazy writes the disk offset
(LBA) of vmlinuz and
cmdline.txt to the last bytes of the bootloader code. Because gokrazy
generates the FAT partition, we know there is never any fragmentation, so the
bootloader does not need to understand the FAT file system.
Symptom
The symptom was that the rtr7/kernelpull request
#434 for updating to Linux 6.7 failed.
My continuous integration tests run in two environments: a physical embedded PC
from PC Engines (apu2c4) in my living room, and a
virtual QEMU PC. Only the QEMU test failed.
On the physical PC Engines apu2c4, the pull request actually passed the boot
test. It would be wrong to draw conclusions like “the issue only affects QEMU”
from this, though, as later attempts to power on the apu2c4 showed the device
boot-looping. I made a mental note that something is different about how the
problem affects the two environments, but both are affected, and decided to
address the failure in QEMU first, then think about the PC Engines failure some
more.
Later in the investigation I found out that this was because the
physical continuous integration setup didn’t disable kexec
yet, so it wasn’t actually
exercising BIOS boot via the Master Boot Record.
Booting from Hard Disk...
Booting from 0000:7c00
In resume (status=0)
In 32bit resume
Attempting a hard reboot
This doesn’t tell me anything unfortunately.
Okay, so something about introducing Linux 6.7 into my setup breaks MBR boot.
I figured using Git Bisection
should identify the problematic change within a few iterations, so I cloned the
currently working Linux 6.6 source code, applied the router7 config and compiled
it.
To my surprise, even my self-built Linux 6.6 kernel would not boot! 😲
Why does the router7 build work when built inside the Docker container, but not
when built on my Linux installation? I decided to rebase the Docker container
from Debian 10 (buster, from 2019) to Debian 12 (bookworm, from 2023) and that
resulted in a non-booting kernel, too!
We have two triggers: building Linux 6.7 or building older Linux, but in newer
environments.
First, check out the rtr7/kernel repository and undo the mitigation:
% mkdir -p go/src/github.com/rtr7/
% cd go/src/github.com/rtr7/
% git clone --depth=1 https://github.com/rtr7/kernel
% cd kernel
% sed -i 's,CONFIG_KERNEL_ZSTD,#CONFIG_KERNEL_ZSTD,g' cmd/rtr7-build-kernel/config.addendum.txt
% go run ./cmd/rtr7-rebuild-kernel
# takes a few minutes to compile Linux
% ls -l vmlinuz
-rw-r--r-- 1 michael michael 15885312 2024-01-28 16:18 vmlinuz
Now, you can either create a new gokrazy instance, replace the kernel and
configure the gokrazy instance to use rtr7/kernel:
Unlike application programs, the Linux kernel doesn’t depend on shared libraries
at runtime, so the dependency footprint is a little smaller than usual. The most
significant dependencies are the components of the build environment, like the C
compiler or the linker.
So let’s look at the software versions of the known-working (Debian 10)
environment and the smallest change we can make to that (upgrading to Debian
11):
Debian 10 (buster) contains gcc-8 (8.3.0-6) and binutils 2.31.1-16.
Debian 11 (bullseye) contains gcc-10 (10.2.1-6) and binutils 2.35.2-2.
To figure out if the problem is triggered by GCC, binutils, or something else
entirely, I checked:
Debian 10 (buster) with its gcc-8, but with binutils 2.35 from bullseye
still works. (Checked by updating /etc/apt/sources.list, then upgrading only
the binutils package.)
Debian 10 (buster), but with gcc-10 and binutils 2.35 results in a
non-booting kernel.
So it seems like upgrading from GCC 8 to GCC 10 triggers the issue.
Instead of working with a Docker container and Debian’s packages, you could also
use Nix. The instructions
aren’t easy, but I used
nix-shell
to quickly try out GCC 8 (works), GCC 9 (works) and GCC 10 (kernel doesn’t boot)
on my machine.
New Hypothesis
To recap, we have two triggers: building Linux 6.7 or building older Linux, but
with GCC 10.
Two theories seemed most plausible to me at this point: Either a change in GCC
10 (possibly enabled by another change in Linux 6.7) is the problem, or the size
of the kernel is the problem.
To verify the file size hypothesis, I padded a known-working vmlinuz file to
the size of a known-broken vmlinuz:
But, even though it had the same file size as the known-broken kernel, the
padded kernel booted!
So I ruled out kernel size as a problem and started researching significant
changes in GCC 10.
This is an incorrect conclusion! The mistake I made here was that I padded the
kernel on the file level, but the boot loader ignores the file system entirely
and takes the size from the kernel header, which I did not update.
Indeed, building the kernel with Debian 11 (bullseye), but with
CONFIG_STACKPROTECTOR=n makes it boot. So, I suspected that our bootloader
does not set up the stack correctly, or similar.
I sent an email to Sebastian Plotz, the author of the Minimal Linux Bootloader,
to ask if he knew about any issues with his bootloader, or if stack protection
seems like a likely issue with his bootloader to him.
To my surprise (it has been over 10 years since he published the bootloader!) he
actually replied: He hadn’t received any problem reports regarding his
bootloader, but didn’t really understand how stack protection would be related.
Debugging with QEMU
At this point, we have isolated at least one trigger for the problem, and
exhausted the easy techniques of upgrading/downgrading surrounding software
versions and asking upstream.
It’s time for a Tooling Level Up! Without a debugger you can only poke into
the dark, which takes time and doesn’t result in thorough
explanations. Particularly in this case, I think it is very likely that any
source modifications could have introduced subtle issues. So let’s reach for a
debugger!
Luckily, QEMU comes with built-in support for the GDB debugger. Just add the -s -S flags to your QEMU command to make QEMU stop execution (-s) and set up a
GDB stub (-S) listening on localhost:1234.
If you wanted to debug the Linux kernel, you could connect GDB to QEMU right
away, but for debugging a boot loader we need an extra step, because the boot
loader runs in Real Mode, but QEMU’s
GDB integration rightfully defaults to the more modern Protected Mode.
When GDB is not configured correctly, it decodes addresses and registers with
the wrong size, which throws off the entire disassembly — compare GDB’s
output with our assembly source:
On the web, people are working around this bug by using a modified target.xml
file. I
tried this, but must have made a mistake — I thought modifying target.xml
didn’t help, but when I wrote this article, I found that it does actually seem
to work. Maybe I didn’t use qemu-system-i386 but the x86_64 variant or
something like that.
Using an older QEMU
It is typically an exercise in frustration to get older software to compile in newer environments.
It’s much easier to use an older environment to run old software.
Unfortunately, the oldest listed version (QEMU 3.1 in Debian 10 (buster)) isn’t
old enough. By querying snapshot.debian.org, we can see that Debian 9
(stretch) contained QEMU
2.8.
So let’s run Debian 9 — the easiest way I know is to use Docker:
% docker run --net=host -v /tmp:/tmp -ti debian:stretch
Unfortunately, the debian:stretch Docker container does not work out of the
box anymore, because its /etc/apt/sources.list points to the deb.debian.org
CDN, which only serves current versions and no longer serves stretch.
So we need to update the sources.list file to point to
archive.debian.org. To correctly install QEMU you need both entries, the
debian line and the debian-security line, because the Docker container has
packages from debian-security installed and gets confused when these are
missing from the package list:
root@650a2157f663:/# cat > /etc/apt/sources.list <<'EOT'
deb http://archive.debian.org/debian/ stretch contrib main non-free
deb http://archive.debian.org/debian-security/ stretch/updates main
EOT
root@650a2157f663:/# apt update
Now we can just install QEMU as usual and start it to debug our boot process:
% gdb
(gdb) set architecture i8086
The target architecture is set to "i8086".
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000fff0 in ?? ()
(gdb) break *0x7c00
Breakpoint 1 at 0x7c00
(gdb) continue
Continuing.
Breakpoint 1, 0x00007c00 in ?? ()
(gdb)
Debug symbols
Okay, so we have GDB attached to QEMU and can step through assembly
instructions. Let’s start debugging!?
Not so fast. There is another Tooling Level Up we need first: debug
symbols. Yes, even for a Minimal Linux Bootloader, which doesn’t use any
libraries or local variables. Having proper names for functions, as well as line
numbers, will be hugely helpful in just a second.
Before debug symbols, I would directly build the bootloader using nasm bootloader.asm, but to end up with a symbol file for GDB, we need to instruct
nasm to generate an ELF file with debug symbols, then use ld to link it and
finally use objcopy to copy the code out of the ELF file again.
After commit
d29c615
in gokrazy/internal/mbr, I have bootloader.elf.
Back in GDB, we can load the symbols using the symbol-file command:
(gdb) set architecture i8086
The target architecture is set to "i8086".
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000fff0 in ?? ()
(gdb) symbol-file bootloader.elf
Reading symbols from bootloader.elf...
(gdb) break *0x7c00
Breakpoint 1 at 0x7c00: file bootloader.asm, line 48.
(gdb) continue
Continuing.
Breakpoint 1, ?? () at bootloader.asm:48
48 cli
(gdb)
Automation with .gdbinit
At this point, we need 4 commands each time we start GDB. We can automate these
by writing them to a .gdbinit file:
% gdb
The target architecture is set to "i8086".
0x0000fff0 in ?? ()
Breakpoint 1 at 0x7c00: file bootloader.asm, line 48.
(gdb)
Understanding program flow
The easiest way to understand program flow seems to be to step through the program.
But Minimal Linux Bootloader (MLB) contains loops that run through thousands of
iterations. You can’t use gdb’s stepi command with that.
Because MLB only contains a few functions, I eventually realized that placing a
breakpoint on each function would be the quickest way to understand the
high-level program flow:
(gdb) b read_kernel_setup
Breakpoint 2 at 0x7c38: file bootloader.asm, line 75.
(gdb) b check_version
Breakpoint 3 at 0x7c56: file bootloader.asm, line 88.
(gdb) b read_protected_mode_kernel
Breakpoint 4 at 0x7c8f: file bootloader.asm, line 105.
(gdb) b read_protected_mode_kernel_2
Breakpoint 5 at 0x7cd6: file bootloader.asm, line 126.
(gdb) b run_kernel
Breakpoint 6 at 0x7cff: file bootloader.asm, line 142.
(gdb) b error
Breakpoint 7 at 0x7d51: file bootloader.asm, line 190.
(gdb) b reboot
Breakpoint 8 at 0x7d62: file bootloader.asm, line 204.
With the working kernel, we get the following transcript:
Breakpoint 3, check_version () at bootloader.asm:88
88 cmp word [es:0x206], 0x204 ; we need protocol version >= 2.04
(gdb)
Continuing.
Breakpoint 4, read_protected_mode_kernel () at bootloader.asm:105
105 mov edx, [es:0x1f4] ; edx stores the number of bytes to load
(gdb)
Continuing.
Breakpoint 1, ?? () at bootloader.asm:48
48 cli
(gdb)
Okay! Now we see that the bootloader starts loading the kernel from disk into
RAM, but doesn’t actually get far enough to call run_kernel, meaning the
problem isn’t with stack protection, with loading a working command line or with
anything inside the Linux kernel.
This lets us rule out a large part of the problem space. We now know that we can
focus entirely on the bootloader and why it cannot load the Linux kernel into
memory.
Let’s take a closer look…
Wait, this isn’t GDB!
In the example above, using breakpoints was sufficient to narrow down the problem.
You might think we used GDB, and it looked like this:
But that’s not GDB! It’s an easy mistake to make. After all, GDB starts up with
just a text prompt, and as you can see from the example above, we can just enter
text and achieve a good result.
To see the real GDB, you need to start it up fully, meaning including its user
interface.
You can either use GDB’s text user interface (TUI), or a graphical user
interface for gdb, such as the one available in Emacs.
The GDB text-mode user interface (TUI)
You’re already familiar with the architecture, target and breakpoint
commands from above. To also set up the text-mode user interface, we run a few
layout commands:
The layout split command loads the text-mode user interface and splits the
screen into a register window, disassembly window and command window.
With layout src we disregard the disassembly window in favor of a source
listing window. Both are in assembly language in our case, but the source
listing contains comments as well.
The layout src command also got rid of the register window, which we’ll get
back using layout regs. I’m not sure if there’s an easier way.
The result looks like this:
The source window will highlight the next line of code that will be executed. On
the left, the B+ marker indicates an enabled breakpoint, which will become
helpful with multiple breakpoints. Whenever a register value changes, the
register and its new value will be highlighted.
The up and down arrow keys scroll the source window.
Use C-x o to switch between the windows.
If you’re familiar with Emacs, you’ll recognize the keyboard shortcut. But as an
Emacs user, you might prefer the GDB Emacs user interface:
Let’s take a look at the loop that we know the bootloader is entering, but not
leaving (neither read_protected_mode_kernel_2 nor run_kernel are ever called):
read_protected_mode_kernel:movedx, [es:0x1f4] ; edx stores the number of bytes to load
shledx, 4.loop:cmpedx, 0jerun_kernelcmpedx, 0xfe00; less than 127*512 bytes remaining?
jbread_protected_mode_kernel_2moveax, 0x7f; load 127 sectors (maximum)
xorbx, bx; no offset
movcx, 0x2000; load temporary to 0x20000
movesi, current_lbacallread_from_hddmovcx, 0x7f00; move 65024 bytes (127*512 byte)
calldo_movesubedx, 0xfe00; update the number of bytes to load
addword [gdt.dest], 0xfe00adcbyte [gdt.dest+2], 0jmpshortread_protected_mode_kernel.loop
The comments explain that the code loads chunks of FE00h == 65024 (127*512)
bytes at a time.
Loading means calling read_from_hdd, then do_move. Let’s take a look at do_move:
do_move:pushedxpushesxorax, axmoves, axmovah, 0x87movsi, gdtint0x15; line 182
jcerrorpopespopedxret
int 0x15 is a call to the BIOS Service Interrupt, which will dispatch the call
based on AH == 87H to the Move Memory Block
(techhelpmanual.com)
function.
This function moves the specified amount of memory (65024 bytes in our case)
from source/destination addresses specified in a Global Descriptor Table (GDT)
record.
We can use GDB to show the addresses of each of do_move’s memory move calls by
telling it to stop at line 182 (the int 0x15 instruction) and print the GDT
record’s destination descriptor:
(gdb) break 182
Breakpoint 2 at 0x7d49: file bootloader.asm, line 176.
(gdb) command 2
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>x/8bx gdt+24
>end
(gdb) continue
Continuing.
Breakpoint 1, ?? () at bootloader.asm:48
42 cli
(gdb)
Continuing.
Breakpoint 2, do_move () at bootloader.asm:182
182 int 0x15
0x7d85: 0xff 0xff 0x00 0x00 0x10 0x93 0x00 0x00
(gdb)
Continuing.
Breakpoint 2, do_move () at bootloader.asm:182
182 int 0x15
0x7d85: 0xff 0xff 0x00 0xfe 0x10 0x93 0x00 0x00
(gdb)
The destination address is stored in byte 2..4. Remember to read these little
endian entries “back to front”.
If we press Return long enough, we eventually end up here:
Breakpoint 2, do_move () at bootloader.asm:182
182 int 0x15
0x7d85: 0xff 0xff 0x00 0x1e 0xff 0x93 0x00 0x00
(gdb)
Continuing.
Breakpoint 2, do_move () at bootloader.asm:182
182 int 0x15
0x7d85: 0xff 0xff 0x00 0x1c 0x00 0x93 0x00 0x00
(gdb)
Continuing.
Breakpoint 1, ?? () at bootloader.asm:48
42 cli
(gdb)
Program received signal SIGTRAP, Trace/breakpoint trap.
0x000079b0 in ?? ()
(gdb)
Now that execution left the bootloader, let’s take a look at the last do_move
call parameters: We notice that the destination address overflowed its 24 byte
data type:
Address #y is 0xff1e00
Address #z is 0x001c00
Root cause
At this point I reached out to Sebastian again to ask him if there was an
(undocumented) fundamental architectural limit to his Minimal Linux Bootloader —
with 24 bit addresses, you can address at most 16 MB of memory.
So, is it impossible to load larger kernels into memory from Real Mode? I’m not
sure.
The current bootloader code prepares a GDT in which addresses are 24 bits long
at most. But note that the techhelpmanual.com documentation that Sebastian
referenced is apparently for the Intel
286 (a 16 bit CPU), and some of the
GDT bytes are declared reserved.
Today’s CPUs are Intel 386-compatible (a
32 bit CPU), which seems to use one of the formerly reserved bytes to represent
bit 24..31 of the address, meaning we might be able to pass 32 bit addresses
to BIOS functions in a GDT after all!
I wasn’t able to find clear authoritative documentation on the Move Memory Block
API on 386+, or whether BIOS functions in general are just expected to work with 32 bit addresses.
Hence I’m thinking that most BIOS implementations should actually support 32
bit addresses for their Move Memory Block implementation — provided you fill the
descriptor accordingly.
Lobsters reader abbeyj pointed
out
that the following code change should fix the truncation and result in a GDT
with all address bits in the right place:
--- i/mbr/bootloader.asm
+++ w/mbr/bootloader.asm
@@ -119,6 +119,7 @@ read_protected_mode_kernel:
sub edx, 0xfe00 ; update the number of bytes to load
add word [gdt.dest], 0xfe00
adc byte [gdt.dest+2], 0
+ adc byte [gdt.dest+5], 0
jmp short read_protected_mode_kernel.loop
read_protected_mode_kernel_2:
…and indeed, in my first test this seems to fix the problem! It’ll take me a
little while to clean this up and submit it. You can follow gokrazy issue
#248 if you’re interested.
Bonus: reading BIOS source
There are actually a couple of BIOS implementations that we can look into to get
a better understanding of how Move Memory Block works.
PhysPt dest = (mem_readd(data+0x1A) &0x00FFFFFF) + (mem_readb(data+0x1E)<<24);
Another implementation is SeaBIOS. Contrary
to DOSBox, SeaBIOS is not just used in emulation: The PC Engines apu uses
coreboot with SeaBIOS. QEMU also uses SeaBIOS.
The SeaBIOS handle_1587 source
code
is a little harder to follow, because it requires knowledge of Real Mode
assembly. The way I read it, SeaBIOS doesn’t truncate or otherwise modify the
descriptors and just passes them to the CPU. On 386 or newer, 32 bit addresses
should work.
Mitigation
While it’s great to understand the limitation we’re running into, I wanted to
unblock the pull request as quickly as possible, so I needed a quick mitigation
instead of investigating if my speculation can be developed into
a proper fix.
When I started router7, we didn’t support loadable kernel modules, so everything
had to be compiled into the kernel. We now do support loadable kernel modules,
so I could have moved functionality into modules.
Instead, I found an even easier quick fix: switching from gzip to zstd
compression. This
saved about 1.8 MB and will buy us some time to implement a proper fix while
unblocking automated new Linux kernel version merges.
Conclusion
I wanted to share this debugging story because it shows a couple of interesting lessons:
Being able to run older versions of various parts of your software stack is a
very valuable debugging tool. It helped us isolate a trigger for the bug
(using an older GCC) and it helped us set up a debugging environment (using
an older QEMU).
Setting up a debugger can be annoying (symbol files, learning the UI) but
it’s so worth it.
Be on the lookout for wrong turns during debugging. Write down every
conclusion and challenge it.
The BIOS can seem mysterious and “too low level” but there are many blog
posts, lectures and tutorials. You can also just read open-source BIOS code
to understand it much better.
Enjoy poking at your BIOS!
Appendix: Resources
I found the following resources helpful:
I run a blog since 2005, spreading knowledge and experience for almost 20 years! :)
Minimal Linux Bootloader debugging story 🐞
https://ift.tt/yYNn0iU
Michael Stapelberg
I maintain two builds of the Linux kernel, a
linux/arm64
build for gokrazy, my Go appliance platform, which started out on the Raspberry Pi, and then alinux/amd64
one for router7, which runs on PCs.The update process for both of these builds is entirely automated, meaning new Linux kernel releases are automatically tested and merged, but recently the continuous integration testing failed to automatically merge Linux 6․7 — this article is about tracking down the root cause of that failure.
Background info on the bootloader
gokrazy started out targeting only the Raspberry Pi, where you configure the bootloader with a plain text file on a FAT partition, so we did not need to include our own UEFI/MBR bootloader.
When I ported gokrazy to work on PCs in BIOS mode, I decided against complicated solutions like GRUB — I really wasn’t looking to maintain a GRUB package. Just keeping GRUB installations working on my machines is enough work. The fact that GRUB consists of many different files (modules) that can go out of sync really does not appeal to me.
For UEFI, there is systemd-boot, which comes as a single-file UEFI program, easy to include. That’s how gokrazy supports UEFI boot. Unfortunately, the PC Engines apu2c4 does not support UEFI, so I also needed an MBR solution.
Instead, I went with Sebastian Plotz’s Minimal Linux Bootloader because it fits entirely into the Master Boot Record (MBR) and does not require any files. In bootloader lingo, this is a stage1-only bootloader. You don’t even need a C compiler to compile its (Assembly) code. It seemed simple enough to integrate: just write the bootloader code into the first sector of the gokrazy disk image; done. The bootloader had its last release in 2012, so no need for updates or maintenance.
You can’t really implement booting a kernel and parsing text configuration files in 446 bytes of 16-bit 8086 assembly instructions, so to tell the bootloader where on disk to load the kernel code and kernel command line from, gokrazy writes the disk offset (LBA) of
vmlinuz
andcmdline.txt
to the last bytes of the bootloader code. Because gokrazy generates the FAT partition, we know there is never any fragmentation, so the bootloader does not need to understand the FAT file system.Symptom
The symptom was that the
rtr7/kernel
pull request #434 for updating to Linux 6.7 failed.My continuous integration tests run in two environments: a physical embedded PC from PC Engines (apu2c4) in my living room, and a virtual QEMU PC. Only the QEMU test failed.
On the physical PC Engines apu2c4, the pull request actually passed the boot test. It would be wrong to draw conclusions like “the issue only affects QEMU” from this, though, as later attempts to power on the apu2c4 showed the device boot-looping. I made a mental note that something is different about how the problem affects the two environments, but both are affected, and decided to address the failure in QEMU first, then think about the PC Engines failure some more.
Later in the investigation I found out that this was because the physical continuous integration setup didn’t disable kexec yet, so it wasn’t actually exercising BIOS boot via the Master Boot Record.
In QEMU, the output I see is:
Notably, the kernel doesn’t even seem to start — no “Decompressing linux” message is printed, the boot just hangs. I tried enabling debug output in SeaBIOS and eventually succeeded, but only with an older QEMU version:
This doesn’t tell me anything unfortunately.
Okay, so something about introducing Linux 6.7 into my setup breaks MBR boot.
I figured using Git Bisection should identify the problematic change within a few iterations, so I cloned the currently working Linux 6.6 source code, applied the router7 config and compiled it.
To my surprise, even my self-built Linux 6.6 kernel would not boot! 😲
Why does the router7 build work when built inside the Docker container, but not when built on my Linux installation? I decided to rebase the Docker container from Debian 10 (buster, from 2019) to Debian 12 (bookworm, from 2023) and that resulted in a non-booting kernel, too!
We have two triggers: building Linux 6.7 or building older Linux, but in newer environments.
First, check out the
rtr7/kernel
repository and undo the mitigation:Now, you can either create a new gokrazy instance, replace the kernel and configure the gokrazy instance to use
rtr7/kernel
:…or you skip these steps and extract my already prepared config to
~/gokrazy/mbr
.Then, build the gokrazy disk image and start it with QEMU:
Up/Downgrade Versions
Unlike application programs, the Linux kernel doesn’t depend on shared libraries at runtime, so the dependency footprint is a little smaller than usual. The most significant dependencies are the components of the build environment, like the C compiler or the linker.
So let’s look at the software versions of the known-working (Debian 10) environment and the smallest change we can make to that (upgrading to Debian 11):
To figure out if the problem is triggered by GCC, binutils, or something else entirely, I checked:
Debian 10 (buster) with its
gcc-8
, but withbinutils 2.35
from bullseye still works. (Checked by updating/etc/apt/sources.list
, then upgrading only thebinutils
package.)Debian 10 (buster), but with
gcc-10
andbinutils 2.35
results in a non-booting kernel.So it seems like upgrading from GCC 8 to GCC 10 triggers the issue.
Instead of working with a Docker container and Debian’s packages, you could also use Nix. The instructions aren’t easy, but I used
nix-shell
to quickly try out GCC 8 (works), GCC 9 (works) and GCC 10 (kernel doesn’t boot) on my machine.New Hypothesis
To recap, we have two triggers: building Linux 6.7 or building older Linux, but with GCC 10.
Two theories seemed most plausible to me at this point: Either a change in GCC 10 (possibly enabled by another change in Linux 6.7) is the problem, or the size of the kernel is the problem.
To verify the file size hypothesis, I padded a known-working
vmlinuz
file to the size of a known-brokenvmlinuz
:But, even though it had the same file size as the known-broken kernel, the padded kernel booted!
So I ruled out kernel size as a problem and started researching significant changes in GCC 10.
This is an incorrect conclusion! The mistake I made here was that I padded the kernel on the file level, but the boot loader ignores the file system entirely and takes the size from the kernel header, which I did not update.
I read that GCC 10 changed behavior with regards to stack protection.
Indeed, building the kernel with Debian 11 (bullseye), but with
CONFIG_STACKPROTECTOR=n
makes it boot. So, I suspected that our bootloader does not set up the stack correctly, or similar.I sent an email to Sebastian Plotz, the author of the Minimal Linux Bootloader, to ask if he knew about any issues with his bootloader, or if stack protection seems like a likely issue with his bootloader to him.
To my surprise (it has been over 10 years since he published the bootloader!) he actually replied: He hadn’t received any problem reports regarding his bootloader, but didn’t really understand how stack protection would be related.
Debugging with QEMU
At this point, we have isolated at least one trigger for the problem, and exhausted the easy techniques of upgrading/downgrading surrounding software versions and asking upstream.
It’s time for a Tooling Level Up! Without a debugger you can only poke into the dark, which takes time and doesn’t result in thorough explanations. Particularly in this case, I think it is very likely that any source modifications could have introduced subtle issues. So let’s reach for a debugger!
Luckily, QEMU comes with built-in support for the GDB debugger. Just add the
-s -S
flags to your QEMU command to make QEMU stop execution (-s
) and set up a GDB stub (-S
) listening onlocalhost:1234
.If you wanted to debug the Linux kernel, you could connect GDB to QEMU right away, but for debugging a boot loader we need an extra step, because the boot loader runs in Real Mode, but QEMU’s GDB integration rightfully defaults to the more modern Protected Mode.
When GDB is not configured correctly, it decodes addresses and registers with the wrong size, which throws off the entire disassembly — compare GDB’s output with our assembly source:
So we need to ensure we use
qemu-system-i386
(qemu-system-x86_64
printsRemote 'g' packet reply is too long
) and configure the GDB target architecture to 16-bit 8086:Unfortunately, the above doesn’t actually work in QEMU 2.9 and newer: https://gitlab.com/qemu-project/qemu/-/issues/141.
On the web, people are working around this bug by using a modified
target.xml
file. I tried this, but must have made a mistake — I thought modifyingtarget.xml
didn’t help, but when I wrote this article, I found that it does actually seem to work. Maybe I didn’t useqemu-system-i386
but thex86_64
variant or something like that.Using an older QEMU
It is typically an exercise in frustration to get older software to compile in newer environments.
It’s much easier to use an older environment to run old software.
By querying
packages.debian.org
, we can see the QEMU versions included in current and previous Debian versions.Unfortunately, the oldest listed version (QEMU 3.1 in Debian 10 (buster)) isn’t old enough. By querying
snapshot.debian.org
, we can see that Debian 9 (stretch) contained QEMU 2.8.So let’s run Debian 9 — the easiest way I know is to use Docker:
Unfortunately, the
debian:stretch
Docker container does not work out of the box anymore, because its/etc/apt/sources.list
points to thedeb.debian.org
CDN, which only serves current versions and no longer servesstretch
.So we need to update the
sources.list
file to point toarchive.debian.org
. To correctly install QEMU you need both entries, thedebian
line and thedebian-security
line, because the Docker container has packages fromdebian-security
installed and gets confused when these are missing from the package list:Now we can just install QEMU as usual and start it to debug our boot process:
Now let’s start GDB and set a breakpoint on address
0x7c00
, which is the address to which the BIOS loades the MBR code and starts execution:Debug symbols
Okay, so we have GDB attached to QEMU and can step through assembly instructions. Let’s start debugging!?
Not so fast. There is another Tooling Level Up we need first: debug symbols. Yes, even for a Minimal Linux Bootloader, which doesn’t use any libraries or local variables. Having proper names for functions, as well as line numbers, will be hugely helpful in just a second.
Before debug symbols, I would directly build the bootloader using
nasm bootloader.asm
, but to end up with a symbol file for GDB, we need to instructnasm
to generate an ELF file with debug symbols, then useld
to link it and finally useobjcopy
to copy the code out of the ELF file again.After commit d29c615 in
gokrazy/internal/mbr
, I havebootloader.elf
.Back in GDB, we can load the symbols using the
symbol-file
command:Automation with .gdbinit
At this point, we need 4 commands each time we start GDB. We can automate these by writing them to a
.gdbinit
file:Understanding program flow
The easiest way to understand program flow seems to be to step through the program.
But Minimal Linux Bootloader (MLB) contains loops that run through thousands of iterations. You can’t use gdb’s
stepi
command with that.Because MLB only contains a few functions, I eventually realized that placing a breakpoint on each function would be the quickest way to understand the high-level program flow:
With the working kernel, we get the following transcript:
With the non-booting kernel, we get:
Okay! Now we see that the bootloader starts loading the kernel from disk into RAM, but doesn’t actually get far enough to call
run_kernel
, meaning the problem isn’t with stack protection, with loading a working command line or with anything inside the Linux kernel.This lets us rule out a large part of the problem space. We now know that we can focus entirely on the bootloader and why it cannot load the Linux kernel into memory.
Let’s take a closer look…
Wait, this isn’t GDB!
In the example above, using breakpoints was sufficient to narrow down the problem.
You might think we used GDB, and it looked like this:
But that’s not GDB! It’s an easy mistake to make. After all, GDB starts up with just a text prompt, and as you can see from the example above, we can just enter text and achieve a good result.
To see the real GDB, you need to start it up fully, meaning including its user interface.
You can either use GDB’s text user interface (TUI), or a graphical user interface for gdb, such as the one available in Emacs.
The GDB text-mode user interface (TUI)
You’re already familiar with the
architecture
,target
andbreakpoint
commands from above. To also set up the text-mode user interface, we run a fewlayout
commands:The
layout split
command loads the text-mode user interface and splits the screen into a register window, disassembly window and command window.With
layout src
we disregard the disassembly window in favor of a source listing window. Both are in assembly language in our case, but the source listing contains comments as well.The
layout src
command also got rid of the register window, which we’ll get back usinglayout regs
. I’m not sure if there’s an easier way.The result looks like this:
The source window will highlight the next line of code that will be executed. On the left, the
B+
marker indicates an enabled breakpoint, which will become helpful with multiple breakpoints. Whenever a register value changes, the register and its new value will be highlighted.The up and down arrow keys scroll the source window.
Use
C-x o
to switch between the windows.If you’re familiar with Emacs, you’ll recognize the keyboard shortcut. But as an Emacs user, you might prefer the GDB Emacs user interface:
The GDB Emacs user interface (M-x gdb)
This is
M-x gdb
withgdb-many-windows
enabled:Debugging the failing loop
Let’s take a look at the loop that we know the bootloader is entering, but not leaving (neither
read_protected_mode_kernel_2
norrun_kernel
are ever called):The comments explain that the code loads chunks of FE00h == 65024 (127*512) bytes at a time.
Loading means calling
read_from_hdd
, thendo_move
. Let’s take a look atdo_move
:int 0x15
is a call to the BIOS Service Interrupt, which will dispatch the call based onAH == 87H
to the Move Memory Block (techhelpmanual.com) function.This function moves the specified amount of memory (65024 bytes in our case) from source/destination addresses specified in a Global Descriptor Table (GDT) record.
We can use GDB to show the addresses of each of
do_move
’s memory move calls by telling it to stop at line 182 (theint 0x15
instruction) and print the GDT record’s destination descriptor:The destination address is stored in byte
2..4
. Remember to read these little endian entries “back to front”.Address 不要相信一个熬夜的人说的每一句话 #1 is
0x100000
.Address 如何收集竞争情报财报解读 #2 is
0x10fe00
.If we press Return long enough, we eventually end up here:
Now that execution left the bootloader, let’s take a look at the last
do_move
call parameters: We notice that the destination address overflowed its 24 byte data type:0xff1e00
0x001c00
Root cause
At this point I reached out to Sebastian again to ask him if there was an (undocumented) fundamental architectural limit to his Minimal Linux Bootloader — with 24 bit addresses, you can address at most 16 MB of memory.
He replied explaining that he didn’t know of this limit either! He then linked to Move Memory Block (techhelpmanual.com) as proof for the 24 bit limit.
Speculation
So, is it impossible to load larger kernels into memory from Real Mode? I’m not sure.
The current bootloader code prepares a GDT in which addresses are 24 bits long at most. But note that the techhelpmanual.com documentation that Sebastian referenced is apparently for the Intel 286 (a 16 bit CPU), and some of the GDT bytes are declared reserved.
Today’s CPUs are Intel 386-compatible (a 32 bit CPU), which seems to use one of the formerly reserved bytes to represent bit
24..31
of the address, meaning we might be able to pass 32 bit addresses to BIOS functions in a GDT after all!I wasn’t able to find clear authoritative documentation on the Move Memory Block API on 386+, or whether BIOS functions in general are just expected to work with 32 bit addresses.
But Microsoft’s 1989 HIMEM.SYS source contains a struct that documents this 32-bit descriptor usage. A more modern reference is this Operating Systems Class from FAU 2023 (page 71/72).
Hence I’m thinking that most BIOS implementations should actually support 32 bit addresses for their Move Memory Block implementation — provided you fill the descriptor accordingly.
If that doesn’t work out, there’s also “Unreal Mode”, which allows using up to 4 GB in Real Mode, but is a change that is a lot more complicated. See also Julio Merino’s “Beyond the 1 MB barrier in DOS” post to get an idea of the amount of code needed.
Update: a fix!
Lobsters reader abbeyj pointed out that the following code change should fix the truncation and result in a GDT with all address bits in the right place:
…and indeed, in my first test this seems to fix the problem! It’ll take me a little while to clean this up and submit it. You can follow gokrazy issue #248 if you’re interested.
Bonus: reading BIOS source
There are actually a couple of BIOS implementations that we can look into to get a better understanding of how Move Memory Block works.
We can look at DOSBox, an open source DOS emulator. Its Move Memory Block implementation does seem to support 32 bit addresses:
Another implementation is SeaBIOS. Contrary to DOSBox, SeaBIOS is not just used in emulation: The PC Engines apu uses coreboot with SeaBIOS. QEMU also uses SeaBIOS.
The SeaBIOS
handle_1587
source code is a little harder to follow, because it requires knowledge of Real Mode assembly. The way I read it, SeaBIOS doesn’t truncate or otherwise modify the descriptors and just passes them to the CPU. On 386 or newer, 32 bit addresses should work.Mitigation
While it’s great to understand the limitation we’re running into, I wanted to unblock the pull request as quickly as possible, so I needed a quick mitigation instead of investigating if my speculation can be developed into a proper fix.
When I started router7, we didn’t support loadable kernel modules, so everything had to be compiled into the kernel. We now do support loadable kernel modules, so I could have moved functionality into modules.
Instead, I found an even easier quick fix: switching from gzip to zstd compression. This saved about 1.8 MB and will buy us some time to implement a proper fix while unblocking automated new Linux kernel version merges.
Conclusion
I wanted to share this debugging story because it shows a couple of interesting lessons:
Being able to run older versions of various parts of your software stack is a very valuable debugging tool. It helped us isolate a trigger for the bug (using an older GCC) and it helped us set up a debugging environment (using an older QEMU).
Setting up a debugger can be annoying (symbol files, learning the UI) but it’s so worth it.
Be on the lookout for wrong turns during debugging. Write down every conclusion and challenge it.
The BIOS can seem mysterious and “too low level” but there are many blog posts, lectures and tutorials. You can also just read open-source BIOS code to understand it much better.
Enjoy poking at your BIOS!
Appendix: Resources
I found the following resources helpful:
I run a blog since 2005, spreading knowledge and experience for almost 20 years! :)
If you want to support my work, you can buy me a coffee.
Thank you for your support! ❤️
via Michael Stapelberg
February 14, 2024 at 07:45PM
The text was updated successfully, but these errors were encountered: