Preface
As a homeserver I run an Orange Pi PC which is powerful enough for my purposes. It runs some basic services such as a DNS server, a DHCP server, Home Assistant, VPN etc. Therefore I want to minimize the downtime as much as possible. As I am planning of reinstalling it with a fresh Debian Bullseye system with a Debian kernel (I am currently running a custom kernel), I am emulating the whole system in QEMU to iron out all configuration errors before deploying it to the Orange Pi itself. While testing I used the -kernel
parameter of QEMU to boot the Debian kernel and everything was fine. But then I tried to run the system in QEMU via U-Boot (as you would on the device itself). So I prepared a virtual SD card with U-Boot, the kernel and the filesystem. All of a sudden I couldn’t get the kernel the boot. U-Boot just keeps hanging on “Starting kernel…”.
Every other kernel worked just fine via U-Boot. I was able to boot my old kernel, the ARCH arm kernel and the Armbian kernel. Why wouldn’t this Debian kernel boot via U-Boot in QEMU?
Debugging
After days of rebuilding and debugging U-Boot with several options I had almost given up. Then I realized I could at least connect the GNU debugger (gdb) to see if QEMU was actually doing anything useful.
I started U-Boot in QEMU and connected gdb. I loaded my DTB and kernel inside U-Boot and set a breakpoint in gdb at the location of the kernel. As the kernel is self-extracting, the load address is also the entrypoint: the first instruction to execute after U-Boot hands control over to the kernel.
The first instruction I encountered was:
tstne r0, #315392 ; 0x4d000
I already knew that every other kernel had NOP’s as first instructions which I learned was for legacy reasons. A quick search taught me this seemingly strange instruction was also acting as NOP but also making it a valid PE/COFF binary for EFI.
However as soon I stepped into the next instruction with stepi
, I noticed the program counter was 0x4
while the kernel was loaded at 0x42000000
. This shouldn’t happen because the tstne
instruction doesn’t directly control the flow of the program. The vector table learns us that address 0x4
is called when it encounters an undefined instruction. When continuing my search in this direction, I stumbled upon this patch by Andre Przywara. The linked discussion on the qemu-devel mailing list learns us why this behaviour is happening.
It boils down to the fact the instruction is actually invalid according to the ARM specification, but is actually executed on real devices without triggering this exception. As QEMU sticks to the specification, it does in fact produce the desired behaviour.
It can take a long time before patches actually make it into the kernel and even longer before Debian starts using a kernel which includes those patches, so for now we can make QEMU bypass these instructions by setting the Z bit to 1 as suggested in the patch. This is easy to do in the debugger. However I temporarily want to script this in U-Boot. I am no assembly expert, but the quickest I can think of to do this, is these two instructions:
mov r0,#0 (0xE3A00000)
cmp r0,#0 (0xE3500000)
After loading the DTB and kernel, I overwrite the first two instructions of the kernel (they are those strange NOP’s anyway) with these instructions in U-Boot. Here I assume the kernel is loaded at the address in ${kernel_addr_r}
:
mw.l ${kernel_addr_r} 0xE3A00000
setexpr kernel_addr_r_2 ${kernel_addr_r} + 0x4
mw.l ${kernel_addr_r_2} 0xE3500000
If I now start the kernel with the bootz
command, we can notice the kernel is now actually loading inside QEMU via U-Boot. These commands can easily be scripted in a boot.cmd
file, but shouldn’t be necessary anymore once deployed to a real device.
Conclusion
The reason I wrote this post, is mainly because I am amazed with the fact that the code already exists in the Linux kernel for at least 4 years and it was only 17 days ago at time of writing that Adam Lackorzynski posted something about this in the mailing list. I am not sure if this is a coincidence, or something changed recently in either the kernel building process for Debian or QEMU. However I can’t imagine I am the only one stumbling across this. And although it was very educational, maybe by writing this post, someone else doesn’t have to spend days figuring out what the problem is…