Sometimes this blog has something like “columns” for long-term topics that keep re-emerging (no pun intended) from time to time. Since I came back to the US last July you can see that one of the big issues I fight with daily is HP servers.
Why is the company I’m working for using HP servers? Mostly because they didn’t have a resident system administrator before I came on board, and just recently they hired an external consultant to set up new servers … the one who set up my nightmare: Apple OS X Server so I’m not sure which of the two options I prefer.
Anyway, as you probably know if you follow my blog, I’ve been busy setting up Munin and Icinga to monitor the status of services and servers — and that helped quite a bit over time. Unfortunately, monitoring HP servers is not easy. You probably remember I wrote a plugin so I could monitor them through IPMI — it worked nicely until I actually got Albert to expose the thresholds in the
ipmi-sensors output, then it broke because HP’s default thresholds are totally messed up and unusable, and it’s not possible to commit new thresholds.
After spending quite some time playing with this, I ended up with write access to Munin’s repositories (thanks, Steve!) and I can now
gloat be worried about having authored quite a few new Munin plugins (the second generation FreeIPMI multigraph plugin is an example, but I also have a sysfs-based hwmon plugin that can get all the sensors in your system in one sweep, a new multigraph-capable Apache plugin, and a couple of SNMP plugins to add to the list). These actually make my work much easier, as they send me warnings when a problem happens without having to worry about it too much, but of course are not enough.
After finally being able to replace the RHEL5 (without a current subscription) with CentOS 5, I’ve started looking in what tools HP makes available to us — and found out that there are mainly two that I care about: one is hpacucli, which is also available in Gentoo’s tree, and the other is called
hp-health and is basically a custom interface to the IPMI features of the server. The latter actually has a working, albeit not really polished, plugin in the Munin contrib repository – which I guess I’ll soon look to transform into a multigraph capable one; I really like multigraph – and that’s how I ended up finding it.
At any rate at that point I realized that I did not add one of the most important checks: the SMART status of the harddrives — originally because I couldn’t get smartctl installed. So I went and checked for it — the older servers are almost all running as IDE (because that’s the firmware’s default.. don’t ask), so those are a different story altogether; the newer servers running CentOS are using an HP controller with SAS drives, using the CCISS (block-layer) driver from the kernel, while one is running Gentoo Linux, and uses the newer, SCSI-layer driver. All of them can’t use
smartctl directly, but they have to use a special command:
smartctl -d cciss,0 — and then either point it to
/dev/sda depending on how which of the two kernel drivers you’re using. They don’t provide all the data that they provide for SATA drives, but they provide enough for Munin’s
hddtemp_smartctl and they do provide an health status…
For what concerns Munin, your configuration would then be something like this in
[hddtemp_smartctl] user root env.drives hd1 hd2 env.type_hd1 cciss,0 env.type_hd2 cciss,1 env.dev_hd1 cciss/c0d0 env.dev_hd2 cciss/c0d0
Depending on how many drives you have and which driver you’re using you will have to edit it of course.
But when I tried to use the default
check_smart.pl script from the nagios-plugins package I had two bad surprises: the first is that they try to validate the parameter passed to the plugin to identify the device type to smartctl, refusing to work for a
cciss type, and the other that it didn’t consider the status message that is printed by this particular driver. I was so pissed, that instead of trying to fix that plugin – which still comes with special handling for IDE-based harddrives! – I decided to write my own, using the Nagios::Plugin Perl module, and releasing it under the MIT license.
You can find my new plugin in my github repository where I think you’ll soon find more plugins — as I’ve had a few things to keep under control anyway. The next step is probably using the
hp-health status to get a good/bad report, hopefully for something that I don’t get already through standard IPMI.
The funny thing about HP’s utilities is that they for the most part just have to present data that is already available from the IPMI interface, but there are a few differences. For instance, the fan speed reported by IPMI is exposed in RPMs — which is the usual way to expose the speed of fans. But on the HP utility, fan speed is actually exposed as a percentage of the maximum fan speed. And that’s how their thresholds are exposed as well (as I said, the thresholds for fan speed are completely messed up on my HP servers).
Oh well, anything else can happen lately, this would be enough for now.