Tonight I couldn’t sleep. What do I do when I cannot sleep? I debug!
As yesterday libvorbis was enough of an headache for me (ended up that the parameters are being reset by libvorbis itself because the third header has an error in parsing… now to find where the error is, that’s a good question), I decided to go with something easier, like Gentoo/FreeBSD/SPARC64 kernel debugging. No I’m not kidding, debugging the problem in the kernel is resulting easier and funnier than debugging an userland library… to decode audio files… that’s parsing a damn header!
Anyway, thanks to Javier (I won’t mistype his nick this time), I got into FreeBSD’s kernel debugging by building a kernel with -g and DDB support. Then, I easily got the trace of the kernel panic:
Tracing pid 1258 tid 100054 td 0xfffff80007aa7c80
panic() at panic+0xcc
trap() at trap+0x38c
-- fast data access mmu miss tar=0xc0b70000 %o7=0xc01ae8ec --
malloc_type_zone_allocated() at malloc_type_zone_allocated+0x14
malloc() at malloc+0x7c
hashinit() at hashinit+0x4c
nullfs_init() at nullfs_init+0x1c
vfs_modevent() at vfs_modevent+0x244
module_register_init() at module_register_init+0x58
linker_load_module() at linker_load_module+0x844
kldload() at kldload+0xf4
syscall() at syscall+0x334
-- syscall (304, FreeBSD ELF64, kldload) %o7=0x100a1c --
userland() at 0x40421288
user trace: trap %o7=0x100a1c
pc 0x40421288, sp 0x7fdffffde31
pc 0x10080c, sp 0x7fdffffdef1
pc 0x4020ab74, sp 0x7fdffffdfb1
this was tracing a kldload of nullfs, but any kldload produces errors, although they seem to be different from module to module.
db> x/ia malloc_type_zone_allocated,16
malloc_type_zone_allocated: save %sp, -0xc0, %sp
malloc_type_zone_allocated+0x4: call critical_enter
malloc_type_zone_allocated+0x8: nop
malloc_type_zone_allocated+0xc: lduw [%g7 + 0x3c], %g1
malloc_type_zone_allocated+0x10: sllx %g1, 6, %g3
------
malloc_type_zone_allocated+0x14: ldx [%i0 + 0x40], %g2
------
malloc_type_zone_allocated+0x18: brz,pt %i1, malloc_type_zone_allocated+0x38
malloc_type_zone_allocated+0x1c: add %g3, %g2, %g4
malloc_type_zone_allocated+0x20: ldx [%g3 + %g2], %g1
malloc_type_zone_allocated+0x24: add %g1, %i1, %g1
malloc_type_zone_allocated+0x28: stx %g1, [%g3 + %g2]
malloc_type_zone_allocated+0x2c: ldx [%g4 + 0x10], %g1
malloc_type_zone_allocated+0x30: add %g1, 0x1, %g1
malloc_type_zone_allocated+0x34: stx %g1, [%g4 + 0x10]
malloc_type_zone_allocated+0x38: subcc %i2, -0x1, %g0
malloc_type_zone_allocated+0x3c: be,pn malloc_type_zone_allocated+0x58
malloc_type_zone_allocated+0x40: or %g0, 0x1, %g2
malloc_type_zone_allocated+0x44: sll %g2, %i2, %g2
malloc_type_zone_allocated+0x48: sra %g2, 0x0, %g2
malloc_type_zone_allocated+0x4c: ldx [%g4 + 0x20], %g1
malloc_type_zone_allocated+0x50: or %g1, %g2, %g1
malloc_type_zone_allocated+0x54: stx %g1, [%g4 + 0x20]
malloc_type_zone_allocated+0x58:
This is the disassembly of the function that died, I’ve artificially separated the point where the crash happens from the rest of the code.
First of all, the thing that scared me was that even if I know nothing of SPARC assembler, and even my Intel assembler is pretty much limited to 8086 instructions (although I still remember most of them clearly, as I wrote an 8086 emulator when I was in high school), I was able to correlate more or less that code with
struct malloc_type_internal *mtip;
struct malloc_type_stats *mtsp;
critical_enter();
mtip = mtp->ks_handle;
mtsp = &mtip->mti_stats[curcpu];
if (size > 0) {
mtsp->mts_memalloced += size;
mtsp->mts_numallocs++;
}
if (zindx != -1)
mtsp->mts_size |= 1 << zindx;
critical_exit();
that is the code of the function in C source.
Now my problem is to find what causes the “fast mmu miss”, or in general the panic. The registers are funny:
db> show reg
g0 0xffffffffffffffff
g1 0xc04ce800 log_cdevsw+0x48
g2 0xffffffffffffffff
g3 0x870ad
g4 0xfffff8000040fff8
g5 0x1dfd fpu_fault_size+0x1c49
g6 0xcb591980
g7 0xc054d7b0 pcpu0+0x1a90
i0 0x12 pcpup+0xb
i1 0xc04816e0
i2 0xcb590ab8
i3 0xa pcpup+0x3
i4 0xcb590b70
i5 0x1
i6 0xcb590221
i7 0xc01d7bac kdb_enter+0x34
tnpc 0xc01d7bb8 kdb_enter+0x40
tpc 0xc01d7bb4 kdb_enter+0x3c
tstate 0x441d001601
kdb_enter+0x3c: ta %xcc, 1
as i0 is set to a quite low value (0x12) and the debugger tells me it’s referred to pcpup (“pcpu pointer”) address plus a value… the problem is that pcpup is .. uh.. loaded in g7:
#define PCPU_REG %g7
register struct pcpu *pcpup __asm__(__XSTRING(PCPU_REG));
I wonder if it’s a miscompile or simply ddb going crazy; still if I understood the ldx operator well enough, it’s trying to load data from g2 (that’s –1) into the address 0x42 … it does not feel too much right.
Anyway, will try to debug this further when I can find someone in Gentoo/SPARC team who can help me understanding SPARC assembler.