As I wrote yesterday I’ve been doing system and network administration work here in LA as well, and I’ve set up Munin and Icinga to warn me when something required maintenance.
Now some of the first probes that Munin forwarded to Icinga we knew already about (in another post I wrote of how the CMOS battery ran out on two of the servers), but one was something that bothered me before as well: one of the boxes only has one CPU on board and it reports a value of 0 instead of N/A.
So I decided to look into updating the firmware of the DL140 G3 and see if it would help us at all; the original firmware on IPMI device was 2.10 while the latest one available is 2.21. Neither support firmware update via HTML. The firmware download, even when selecting the RedHat Enterprise Linux option is a Windows EXE file (not an auto-extract archive, which you can extract from Linux, but their usual full-fledged setup software to extract in
C:SWSetup). When you extract it, you’re presented with instructions on how to build an USB key which you can then use to update the firmware via FreeDOS…
You can guess I wasn’t amused.
After searching around a bit more I found out that there is a way to update this over the network. It’s described in HP’s advanced iLO usage guide, and seems to work fine, but it also requires another step to be taken in Windows (or FreeDOS): you have to use the
ROMPAQ.EXE utility to decompress the compressed firmware image.
*I wonder, why does HP provide you with two copies of the compressed firmware image, for a grand total of 3MB, instead of only one of the uncompressed one (2MB)? I suppose the origin of the compressed image is to be found in the 1.44MB floppy disk size limitation, but nowadays you don’t use floppies… oh well.*
After you have the uncompressed image, you have to set up a TFTP server.. which luckily I already had laying around from when I updated the firmware of the APC powerstrips discussed in one of the posts linked above. So I just added the IPMI firmware image, and moved on to the next step.
The next step consists of connecting via telnet to the box and issue two commands:
cd map1/firmware1 followed by
load -source //$serverip/$filename -oemhpfiletype csr … the file is downloaded via TFTP and the BMC rebooted. Afterwards you have to clear out the SDR cache of FreeIPMI as
ipmi-sensors wouldn’t work otherwise.
This did fix the critical notification I was receiving .. to a point. First of all, the fan speed has still bogus thresholds (and I’m not sure if it’s a bug in FreeIPMI or one in the firmware at this point) as it reports the upper limits instead of the lower ones). Second of all the way it fixed the misreported CPU thermal sensor is by … not reporting any temperature off either thermal sensor! Now both CPU temperatures are gone and only ambient temperature is available. D’oh!
Another funky issue is that I’m still fighting to get Munin to tell Icinga that “everything’s okay” — the way Munin contacts
send_nsca is connected to the limits so if there are no limits that are present, it seems like it simply doesn’t report anything at all. This is something else I have to fix this week.
Now back to doing the firmware updates on the remaining boxes…
Update: turns out HP updates are worse than the original firmware in some ways. Not only the CPU Thermal Diodes are no longer represented, but the voltages lost their thresholds altogether! The end result of which is that now it says that it’s all a-ok! Even if the 3V battery is reported at 0.04V!. Which basically means that I have to set my own limits on things, but at least it should work as intended afterwards.
Oh and the DL160 G6? First of all, this time the firmware update has a web interface… to tell it which file to request from which TFTP server. Too bad that all the firmware updates that I can run on my systems require the bootcode to be updated as well, which means we’ll have to schedule some maintenance time when I come back from VDDs.