Trouble with memtest

So there is something that doesn’t feel extremely right on the tinderbox — as in there are a few logs spotting ICE and a few killed Ruby processes that make no sense at all. This is usually indication of bad memory. Now it is true that I ran a one-pass test of the memory when I got the system, and it didn’t spot anything, and that this does not happen consistently, so I wanted to give it a 24 hours memory testing — it should be easy thanks to a serial console and the memtest86+ software, no?

Well, no. Let’s start with a bit of background on the setup. Excelsior, the server that is running the two tinderbox instances, is a Supermicro barebone, which integrates an IPMI 2.0 SBMC. This allows me to control it just fine from here (Italy) while the server is back in Los Angeles. At least in theory; while the server is on a public IP, the IPMI interface is connected to the VPN of my employer back in the US, so to actually connect to it I have to jump through an SSH host — which is easy done on Linux — but not on Windows.

The serial console, by the way, is tremendously easy to get by as you can simply SSH into the IPMI IP and you can use some semi-standard commands to get to it, in particular you just need to run cd /system1/sol1 and start. Unfortunately, my first blind setup of grub (2) and the Linux kernel was wrong, as I set them to output to ttyS0 — while the IPMI is forwarding you ttyS1. And finding how to set up grub2 to use a serial console wasn’t easy.

What you have to do is editing /etc/default/grub and add this:

GRUB_SERIAL_COMMAND="serial --unit=1 --speed=115200"

This will set it to use the second serial interface (--unit=1) as 115200,8n1 (8n1 is the default). And there you go. Actually the command editing seems to work more reliably on the serial console than on the display.

So this is done, what’s the next? Well, next is getting memtest to work — it doesn’t help that the pre-compiled binary provided by the upstream project is not able to start. The problem is a “too small lower memory” which is caused by a combination of grub, BIOS and the compiled file itself. While for some systems it’s enough to use a custom compiled version such as the one provided by sys-apps/memtest86+, on Excelsior that didn’t work. So I had to go with the fallback: the ebuild installs both an old-style Linux kernel bootable binary and a netbsd-style binary; as far as I can tell the latter does not support boot parameters though.

The correct way to boot that particular method on Grub 2 is editing /etc/grub.d/40_custom and add:

menuentry "Memtest86+" {
    insmod bsd
    knetbsd /boot/memtest86plus/memtest.netbsd

For those wondering, yes I’m working as we speak on ebuilds that install the grub2 extra configuration file by themselves, and you should have them by the end of the day. This involves both memtest86+ and, as you’ll see in a moment, memtest86 itself. This will make it much easier to deal with the packages.

Ah of course this has to be built on my laptop, because both memtest86+ and memtest86 require a multilib compiler as they are built 32-bit. Sigh.

Unfortunately, not even that is good enough for my system. While with this code it boots, instead of refusing to do so, it seems to get stuck during initialisation and the test never starts. But how do I know that, given that memtest by default does not output to serial, and when it does it outputs on serial 0?

Well, the IPMI interfaces actually has what they call an iKVM written in Java, not to be confused with another Java IKVM — the problem with it is that it doesn’t work with IcedTea and thus you have to use Oracle’s JRE to run it; the bug involves not only Supermicro systems, but also Dell and others. Why it hasn’t been solved on the IcedTea side is something I have no idea about.

While the package uses a standard RFB/VNC protocol, it implements authentication in a non-standard way for what I can tell, so I can’t simply login as I’d like to do. It also probably either has some extensions or an extra signalling protocol, as it can be used to set up “virtual media” such as virtual CD-Roms and virtual floppy images.

Now, this latter detail should give me enough to deal with the memtest issue, as I’d just have to connect a virtual ISO of memtest to get it to work but … Java segfaults (it uses a native library) the moment I try to do so! I have yet to check whether this is simply because it’s trying to use a signalling port that is unavailable, but it doesn’t feel very likely.

Okay so memtest86+ boots as a NetBSD-style kernel, but it doesn’t seem like it’s able to do anything — what about the original project? Memtest86 is still alive and kicking, and released a new version last October (called 4.0b but versioned 4.0s) which supports “up to 32 cores and 8 TB of memory” — reading such release notes I’m afraid that the reason why Memtest86+ doesn’t work is simply that it’s too much memory, or too many cores (32 cores, 64GB of memory).

Unfortunately Memtest86 doesn’t seem to build a NetBSD kernel style binary, so I can’t boot that either. Which means I’m stuck.

Interestingly, Memtest86+ released a 5.00beta back in May — unfortunately there are two big problems: there is no download for the NetBSD kernel, and most worrisome, they didn’t release the sources. Given the project is GPL-2 and includes code from the original Memtest86 project (which the maintainer of Memtest86+ has no way to claim rights to), this is a license violation.

So now I’m using the user-space memtester, which seems to work fine even on hardened kernels, with the caveat that it doesn’t allow to test the full range of memory but just a little piece of it at a time. Sigh. No easy way out, eh?