So there is something that doesn’t feel extremely right on the tinderbox — as in there are a few logs spotting ICE and a few killed Ruby processes that make no sense at all. This is usually indication of bad memory. Now it is true that I ran a one-pass test of the memory when I got the system, and it didn’t spot anything, and that this does not happen consistently, so I wanted to give it a 24 hours memory testing — it should be easy thanks to a serial console and the memtest86+ software, no?
Well, no. Let’s start with a bit of background on the setup. Excelsior, the server that is running the two tinderbox instances, is a Supermicro barebone, which integrates an IPMI 2.0 SBMC. This allows me to control it just fine from here (Italy) while the server is back in Los Angeles. At least in theory; while the server is on a public IP, the IPMI interface is connected to the VPN of my employer back in the US, so to actually connect to it I have to jump through an SSH host — which is easy done on Linux — but not on Windows.
The serial console, by the way, is tremendously easy to get by as you can simply SSH into the IPMI IP and you can use some semi-standard commands to get to it, in particular you just need to run cd /system1/sol1
and start
. Unfortunately, my first blind setup of grub (2) and the Linux kernel was wrong, as I set them to output to ttyS0
— while the IPMI is forwarding you ttyS1
. And finding how to set up grub2 to use a serial console wasn’t easy.
What you have to do is editing /etc/default/grub
and add this:
GRUB_TERMINAL="serial"
GRUB_SERIAL_COMMAND="serial --unit=1 --speed=115200"
This will set it to use the second serial interface (--unit=1
) as 115200,8n1 (8n1 is the default). And there you go. Actually the command editing seems to work more reliably on the serial console than on the display.
So this is done, what’s the next? Well, next is getting memtest to work — it doesn’t help that the pre-compiled binary provided by the upstream project is not able to start. The problem is a “too small lower memory” which is caused by a combination of grub, BIOS and the compiled file itself. While for some systems it’s enough to use a custom compiled version such as the one provided by sys-apps/memtest86+, on Excelsior that didn’t work. So I had to go with the fallback: the ebuild installs both an old-style Linux kernel bootable binary and a netbsd-style binary; as far as I can tell the latter does not support boot parameters though.
The correct way to boot that particular method on Grub 2 is editing /etc/grub.d/40_custom
and add:
menuentry "Memtest86+" {
insmod bsd
knetbsd /boot/memtest86plus/memtest.netbsd
}
For those wondering, yes I’m working as we speak on ebuilds that install the grub2 extra configuration file by themselves, and you should have them by the end of the day. This involves both memtest86+ and, as you’ll see in a moment, memtest86 itself. This will make it much easier to deal with the packages.
Ah of course this has to be built on my laptop, because both memtest86+ and memtest86 require a multilib compiler as they are built 32-bit. Sigh.
Unfortunately, not even that is good enough for my system. While with this code it boots, instead of refusing to do so, it seems to get stuck during initialisation and the test never starts. But how do I know that, given that memtest by default does not output to serial, and when it does it outputs on serial 0?
Well, the IPMI interfaces actually has what they call an iKVM written in Java, not to be confused with another Java IKVM — the problem with it is that it doesn’t work with IcedTea and thus you have to use Oracle’s JRE to run it; the bug involves not only Supermicro systems, but also Dell and others. Why it hasn’t been solved on the IcedTea side is something I have no idea about.
While the package uses a standard RFB/VNC protocol, it implements authentication in a non-standard way for what I can tell, so I can’t simply login as I’d like to do. It also probably either has some extensions or an extra signalling protocol, as it can be used to set up “virtual media” such as virtual CD-Roms and virtual floppy images.
Now, this latter detail should give me enough to deal with the memtest issue, as I’d just have to connect a virtual ISO of memtest to get it to work but … Java segfaults (it uses a native library) the moment I try to do so! I have yet to check whether this is simply because it’s trying to use a signalling port that is unavailable, but it doesn’t feel very likely.
Okay so memtest86+ boots as a NetBSD-style kernel, but it doesn’t seem like it’s able to do anything — what about the original project? Memtest86 is still alive and kicking, and released a new version last October (called 4.0b but versioned 4.0s) which supports “up to 32 cores and 8 TB of memory” — reading such release notes I’m afraid that the reason why Memtest86+ doesn’t work is simply that it’s too much memory, or too many cores (32 cores, 64GB of memory).
Unfortunately Memtest86 doesn’t seem to build a NetBSD kernel style binary, so I can’t boot that either. Which means I’m stuck.
Interestingly, Memtest86+ released a 5.00beta back in May — unfortunately there are two big problems: there is no download for the NetBSD kernel, and most worrisome, they didn’t release the sources. Given the project is GPL-2 and includes code from the original Memtest86 project (which the maintainer of Memtest86+ has no way to claim rights to), this is a license violation.
So now I’m using the user-space memtester, which seems to work fine even on hardened kernels, with the caveat that it doesn’t allow to test the full range of memory but just a little piece of it at a time. Sigh. No easy way out, eh?
A few updates: * indeed Java was crashing because it was unable to reach the port it uses for the side channel used for the virtual storage — it defaults to port 623 (privileged), but you can tweak the @jnlp@ file to make it connect to a different port, which makes it possible to forward it to a non privileged port such as 863; * the protocol it uses is not recognized by Wireshark, a first glance it seems to be some kind of USB-over-IP — interestingly enough, it reads the ISO file, keeps it open, but it doesn’t seem to need to have it open to work, so I’m really not sure how that thing works; * I’ll probably write a blog post explicitly detailing what I found with a very _very_ quick reverse engineering of ATEN’s iKVM — hopefully there is already some software supporting this, or if not, there might be enough interest to get something going; * once I was able to get the virtual storage going, I connected Memtest86+ ISO and not only it booted fine but it also _started_ fine — I’m now waiting for completion, it’s just taking a _very_ long time; I should have used the Memtest86 4.0s ISO which supports multi-core.
I seem to remember getting the jnlp for the iKVM working just fine on my system with icedtea.Details:ubuntu 12.04 amd64with the following relvant packages:icedtea-7-jre-cacaoicedtea-7-jre-jamvmicedtea-7-pluginicedtea-netxicedtea-netx-commonopenjdk-7-jreopenjdk-7-jre-headlessopenjdk-7-jre-lib
IPMIView is a wonderful utility for SM to avoid JNLP problems. However, I tried it against only Sun/Oracle JRE, not with IcedTea.ftp://ftp.supermicro.com/ut…But give a chance for it. It supports both SOL console and iKVM too, and have an option to reset BMC if it is needed (sometimes with some firmware iKVM can stuck in wrong state). And VirtualMedia works like a charm with it.Just try it out.
Ah thanks! I missed that one. It’s definitely something I want to look at, and possibly package if it makes sense — too bad they don’t publish the sources.I still wouldn’t mind an open-source application to do this. Even better if it could lead to something such as an Android or iOS application, as that would be even more interesting. I guess this might be a good place to start a new project from scratch.P.S.: I’ve left the server to memtest the whole day and I’ll check back on it tomorrow, we’ll see how it goes.
I know this is a bit old (I had computer issues and am catching up), but I’m sure others will still read it and can hopefully use this info…The (linux) kernel has a memtest kernel-config option. 1-4 patterns, AFAIK. Not as thorough as memtext86(+), but given that you’re already building/running a linux kernel on the box, that should be FAR easier to try. Of course a clean result wouldn’t mean as much as a clean memtest86, but if it came out unclean, you’d save the trouble.Meanwhile, I had a system/memory at one point (original Opteron so it it was registered/ECC too!) that passed memtest86 just fine, but would occasionally fail (tarball bunzip2 failure was the most common symptom, occasional ICEs, sometimes system lockups or MCEs which were harder to decode, back then) under higher load than memtext86 apparently loaded it with for the tests. I fought with that memory for some time, no money for more, until a BIOS update allowed me to underclock it slightly. Once I lowered the memory clock-rate slightly (original DDR, pc3200 rated, pc3000 speed is what I ended up underclocking to), I was able to recover a bit of the lost performance by tightening up the various wait-state latencies somewhat, and it was rock-stable after that.The problem was not the memory cells, which is what memtest86 tests and which were perfectly fine, but that the capacitance on the board traces was evidently just at borderline tolerance in one direction, while the memory tolerance was evidently borderline in the other direction, thus the problem on memory access, not on actual storage.Eventually I upgraded the RAM (1 GB to 8 GB), and I was able to clock the new RAM normally, so it was indeed the sticks, but memtest86 came up clean as it wasn’t the cells, but the clock tolerances at rated memory speed.So if memtest86 comes up clean, try underclocking the memory a bit and see if it helps. That’s what it took for me here, that time.Duncan
Thanks for the idea, Duncan. Indeed I still seem to get some random segfault that I can’t reproduce properly.I’ll wait till the boost test complete (so likely tomorrow) then I’ll try reducing the speed myself as well.. we’ll see.