Tonight I couldn’t sleep. What do I do when I cannot sleep? I debug!

As yesterday libvorbis was enough of an headache for me (ended up that the parameters are being reset by libvorbis itself because the third header has an error in parsing… now to find where the error is, that’s a good question), I decided to go with something easier, like Gentoo/FreeBSD/SPARC64 kernel debugging. No I’m not kidding, debugging the problem in the kernel is resulting easier and funnier than debugging an userland library… to decode audio files… that’s parsing a damn header!

Anyway, thanks to Javier (I won’t mistype his nick this time), I got into FreeBSD’s kernel debugging by building a kernel with -g and DDB support. Then, I easily got the trace of the kernel panic:

Tracing pid 1258 tid 100054 td 0xfffff80007aa7c80
panic() at panic+0xcc
trap() at trap+0x38c
-- fast data access mmu miss tar=0xc0b70000 %o7=0xc01ae8ec --
malloc_type_zone_allocated() at malloc_type_zone_allocated+0x14
malloc() at malloc+0x7c
hashinit() at hashinit+0x4c
nullfs_init() at nullfs_init+0x1c
vfs_modevent() at vfs_modevent+0x244
module_register_init() at module_register_init+0x58
linker_load_module() at linker_load_module+0x844
kldload() at kldload+0xf4
syscall() at syscall+0x334
-- syscall (304, FreeBSD ELF64, kldload) %o7=0x100a1c --
userland() at 0x40421288
user trace: trap %o7=0x100a1c
pc 0x40421288, sp 0x7fdffffde31
pc 0x10080c, sp 0x7fdffffdef1
pc 0x4020ab74, sp 0x7fdffffdfb1

this was tracing a kldload of nullfs, but any kldload produces errors, although they seem to be different from module to module.

db> x/ia malloc_type_zone_allocated,16
malloc_type_zone_allocated:     save            %sp, -0xc0, %sp
malloc_type_zone_allocated+0x4: call            critical_enter
malloc_type_zone_allocated+0x8: nop
malloc_type_zone_allocated+0xc: lduw            [%g7 + 0x3c], %g1
malloc_type_zone_allocated+0x10:        sllx            %g1, 6, %g3
malloc_type_zone_allocated+0x14:        ldx             [%i0 + 0x40], %g2
malloc_type_zone_allocated+0x18:        brz,pt          %i1, malloc_type_zone_allocated+0x38
malloc_type_zone_allocated+0x1c:        add             %g3, %g2, %g4
malloc_type_zone_allocated+0x20:        ldx             [%g3 + %g2], %g1
malloc_type_zone_allocated+0x24:        add             %g1, %i1, %g1
malloc_type_zone_allocated+0x28:        stx             %g1, [%g3 + %g2]
malloc_type_zone_allocated+0x2c:        ldx             [%g4 + 0x10], %g1
malloc_type_zone_allocated+0x30:        add             %g1, 0x1, %g1
malloc_type_zone_allocated+0x34:        stx             %g1, [%g4 + 0x10]
malloc_type_zone_allocated+0x38:        subcc           %i2, -0x1, %g0
malloc_type_zone_allocated+0x3c:        be,pn           malloc_type_zone_allocated+0x58
malloc_type_zone_allocated+0x40:        or              %g0, 0x1, %g2
malloc_type_zone_allocated+0x44:        sll             %g2, %i2, %g2
malloc_type_zone_allocated+0x48:        sra             %g2, 0x0, %g2
malloc_type_zone_allocated+0x4c:        ldx             [%g4 + 0x20], %g1
malloc_type_zone_allocated+0x50:        or              %g1, %g2, %g1
malloc_type_zone_allocated+0x54:        stx             %g1, [%g4 + 0x20]

This is the disassembly of the function that died, I’ve artificially separated the point where the crash happens from the rest of the code.
First of all, the thing that scared me was that even if I know nothing of SPARC assembler, and even my Intel assembler is pretty much limited to 8086 instructions (although I still remember most of them clearly, as I wrote an 8086 emulator when I was in high school), I was able to correlate more or less that code with

        struct malloc_type_internal *mtip;
        struct malloc_type_stats *mtsp;

        mtip = mtp->ks_handle;
        mtsp = &mtip->mti_stats[curcpu];
        if (size > 0) {
                mtsp->mts_memalloced += size;
        if (zindx != -1)
                mtsp->mts_size |= 1 << zindx;

that is the code of the function in C source.

Now my problem is to find what causes the “fast mmu miss”, or in general the panic. The registers are funny:

db> show reg
g0          0xffffffffffffffff
g1          0xc04ce800  log_cdevsw+0x48
g2          0xffffffffffffffff
g3             0x870ad
g4          0xfffff8000040fff8
g5              0x1dfd  fpu_fault_size+0x1c49
g6          0xcb591980
g7          0xc054d7b0  pcpu0+0x1a90
i0                0x12  pcpup+0xb
i1          0xc04816e0
i2          0xcb590ab8
i3                 0xa  pcpup+0x3
i4          0xcb590b70
i5                 0x1
i6          0xcb590221
i7          0xc01d7bac  kdb_enter+0x34
tnpc        0xc01d7bb8  kdb_enter+0x40
tpc         0xc01d7bb4  kdb_enter+0x3c
tstate      0x441d001601
kdb_enter+0x3c: ta              %xcc, 1

as i0 is set to a quite low value (0x12) and the debugger tells me it’s referred to pcpup (“pcpu pointer”) address plus a value… the problem is that pcpup is .. uh.. loaded in g7:

#define PCPU_REG %g7

register struct pcpu *pcpup __asm__(__XSTRING(PCPU_REG));

I wonder if it’s a miscompile or simply ddb going crazy; still if I understood the ldx operator well enough, it’s trying to load data from g2 (that’s –1) into the address 0x42 … it does not feel too much right.

Anyway, will try to debug this further when I can find someone in Gentoo/SPARC team who can help me understanding SPARC assembler.