Monitoring HP servers

Sometimes this blog has something like “columns” for long-term topics that keep re-emerging (no pun intended) from time to time. Since I came back to the US last July you can see that one of the big issues I fight with daily is HP servers.

Why is the company I’m working for using HP servers? Mostly because they didn’t have a resident system administrator before I came on board, and just recently they hired an external consultant to set up new servers … the one who set up my nightmare: Apple OS X Server so I’m not sure which of the two options I prefer.

Anyway, as you probably know if you follow my blog, I’ve been busy setting up Munin and Icinga to monitor the status of services and servers — and that helped quite a bit over time. Unfortunately, monitoring HP servers is not easy. You probably remember I wrote a plugin so I could monitor them through IPMI — it worked nicely until I actually got Albert to expose the thresholds in the ipmi-sensors output, then it broke because HP’s default thresholds are totally messed up and unusable, and it’s not possible to commit new thresholds.

After spending quite some time playing with this, I ended up with write access to Munin’s repositories (thanks, Steve!) and I can now gloat be worried about having authored quite a few new Munin plugins (the second generation FreeIPMI multigraph plugin is an example, but I also have a sysfs-based hwmon plugin that can get all the sensors in your system in one sweep, a new multigraph-capable Apache plugin, and a couple of SNMP plugins to add to the list). These actually make my work much easier, as they send me warnings when a problem happens without having to worry about it too much, but of course are not enough.

After finally being able to replace the RHEL5 (without a current subscription) with CentOS 5, I’ve started looking in what tools HP makes available to us — and found out that there are mainly two that I care about: one is hpacucli, which is also available in Gentoo’s tree, and the other is called hp-health and is basically a custom interface to the IPMI features of the server. The latter actually has a working, albeit not really polished, plugin in the Munin contrib repository – which I guess I’ll soon look to transform into a multigraph capable one; I really like multigraph – and that’s how I ended up finding it.

At any rate at that point I realized that I did not add one of the most important checks: the SMART status of the harddrives — originally because I couldn’t get smartctl installed. So I went and checked for it — the older servers are almost all running as IDE (because that’s the firmware’s default.. don’t ask), so those are a different story altogether; the newer servers running CentOS are using an HP controller with SAS drives, using the CCISS (block-layer) driver from the kernel, while one is running Gentoo Linux, and uses the newer, SCSI-layer driver. All of them can’t use smartctl directly, but they have to use a special command: smartctl -d cciss,0 — and then either point it to /dev/cciss/c0d0 or /dev/sda depending on how which of the two kernel drivers you’re using. They don’t provide all the data that they provide for SATA drives, but they provide enough for Munin’s hddtemp_smartctl and they do provide an health status…

For what concerns Munin, your configuration would then be something like this in /etc/munin/plugin-conf.d/hddtemp_smartctl:

[hddtemp_smartctl]
user root
env.drives hd1 hd2
env.type_hd1 cciss,0
env.type_hd2 cciss,1
env.dev_hd1 cciss/c0d0
env.dev_hd2 cciss/c0d0

Depending on how many drives you have and which driver you’re using you will have to edit it of course.

But when I tried to use the default check_smart.pl script from the nagios-plugins package I had two bad surprises: the first is that they try to validate the parameter passed to the plugin to identify the device type to smartctl, refusing to work for a cciss type, and the other that it didn’t consider the status message that is printed by this particular driver. I was so pissed, that instead of trying to fix that plugin – which still comes with special handling for IDE-based harddrives! – I decided to write my own, using the Nagios::Plugin Perl module, and releasing it under the MIT license.

You can find my new plugin in my github repository where I think you’ll soon find more plugins — as I’ve had a few things to keep under control anyway. The next step is probably using the hp-health status to get a good/bad report, hopefully for something that I don’t get already through standard IPMI.

The funny thing about HP’s utilities is that they for the most part just have to present data that is already available from the IPMI interface, but there are a few differences. For instance, the fan speed reported by IPMI is exposed in RPMs — which is the usual way to expose the speed of fans. But on the HP utility, fan speed is actually exposed as a percentage of the maximum fan speed. And that’s how their thresholds are exposed as well (as I said, the thresholds for fan speed are completely messed up on my HP servers).

Oh well, anything else can happen lately, this would be enough for now.

Nagios, Icinga, and a security nightmare

If you’ve followed my blog in the past few weeks, I’ve been doing quite some work between Munin and the Nagios packaging (I leave Icinga to prometheanfire!), as well as working closely with Munin upstream by feeding them patches — yesterday I actually got access to the Munin contrib repository so now I can help make sure that the plugins reach a state where they can actually be redistributed and packaged.

I also spent some time clearing up what was once called nagios-nsca and nagios-nrpe (which are now just nsca and nrpe since they work just fine with Icinga as well, and the nagios- part was never part of the upstream names anyway; kudos to Markos for noticing I didn’t check correctly for revdeps, by the way) — now you got minimal USE flags that you can turn on to avoid building their respective daemon, although you have to remember that you have to enable minimal for nsca on the nodes, and for nrpe on the master. They also both come with new init scripts that are simplified and leverage the new functionalities in OpenRC.

There is though something that is not making me sleep well — and that’s beside the usual problems I have with sleeping. Let me try to explain.

Some of the Nagios/Icinga tests can’t be executed remotely as they are, obviously — things like S.M.A.R.T. monitoring need to be executed on the box they have to monitor obviously. So how do you fix this? You use nrpe — the Nagios Remote Plugin Executor. This is basically a daemon that is used to execute commands on the node (more or less the way Munin’s node works). Unfortunately, unlike Munin, both Icinga proper and NRPE don’t allow you to choose on a per-plugin basis which user to use (to do so, Munin has its node running as root).

Instead, everything is executed by the nagios user, and if you need to access something that the user can’t access, you can work it around by using a setuid-root plugin (these are tied to the suid USE flag for nagios-plugins in Gentoo). But this, of course, only works for binaries, not scripts. And here’s the first problem: to check the S.M.A.R.T. status of an IDE drive, you can use the check_ide_smart tool that reimplements the whole protocol… to check the status of a SATA drive you should use check_smart.pl that uses SmartMonTools to take care of it.

But how can the script access the disk? Well, it does it in the simplest way: it uses sudo. Of course this means that the nagios user has to have sudo access… afraid that this would get people to give unconditional sudo access to the nagios user, I decided to work it around by installing my own configuration file for sudo in the ebuild, making use of the new /etc/sudoers.d folder, which means that on a default install, just the commands that are expected will be allowed for the nagios user. And this is good.

But sometimes the plugins themselves don’t seem to care about using sudo directly; instead they rely on being executed with an user that has enough privileges; for this reason, the nrpe configuration allows you to prefix all commands with any command of your choice, with the default being… sudo! And their documentation suggest to make sure that the user running nrpe does not have write access to the directory to avoid security issues… you can understand that it’s not the only bad idea you could have, there.

Sigh, this stuff is a complete security nightmare, truly.

Updating HP iLO 2.x

As I wrote yesterday I’ve been doing system and network administration work here in LA as well, and I’ve set up Munin and Icinga to warn me when something required maintenance.

Now some of the first probes that Munin forwarded to Icinga we knew already about (in another post I wrote of how the CMOS battery ran out on two of the servers), but one was something that bothered me before as well: one of the boxes only has one CPU on board and it reports a value of 0 instead of N/A.

So I decided to look into updating the firmware of the DL140 G3 and see if it would help us at all; the original firmware on IPMI device was 2.10 while the latest one available is 2.21. Neither support firmware update via HTML. The firmware download, even when selecting the RedHat Enterprise Linux option is a Windows EXE file (not an auto-extract archive, which you can extract from Linux, but their usual full-fledged setup software to extract in C:SWSetup). When you extract it, you’re presented with instructions on how to build an USB key which you can then use to update the firmware via FreeDOS…

You can guess I wasn’t amused.

After searching around a bit more I found out that there is a way to update this over the network. It’s described in HP’s advanced iLO usage guide, and seems to work fine, but it also requires another step to be taken in Windows (or FreeDOS): you have to use the ROMPAQ.EXE utility to decompress the compressed firmware image.

*I wonder, why does HP provide you with two copies of the compressed firmware image, for a grand total of 3MB, instead of only one of the uncompressed one (2MB)? I suppose the origin of the compressed image is to be found in the 1.44MB floppy disk size limitation, but nowadays you don’t use floppies… oh well.*

After you have the uncompressed image, you have to set up a TFTP server.. which luckily I already had laying around from when I updated the firmware of the APC powerstrips discussed in one of the posts linked above. So I just added the IPMI firmware image, and moved on to the next step.

The next step consists of connecting via telnet to the box and issue two commands: cd map1/firmware1 followed by load -source //$serverip/$filename -oemhpfiletype csr … the file is downloaded via TFTP and the BMC rebooted. Afterwards you have to clear out the SDR cache of FreeIPMI as ipmi-sensors wouldn’t work otherwise.

This did fix the critical notification I was receiving .. to a point. First of all, the fan speed has still bogus thresholds (and I’m not sure if it’s a bug in FreeIPMI or one in the firmware at this point) as it reports the upper limits instead of the lower ones). Second of all the way it fixed the misreported CPU thermal sensor is by … not reporting any temperature off either thermal sensor! Now both CPU temperatures are gone and only ambient temperature is available. D’oh!

Another funky issue is that I’m still fighting to get Munin to tell Icinga that “everything’s okay” — the way Munin contacts send_nsca is connected to the limits so if there are no limits that are present, it seems like it simply doesn’t report anything at all. This is something else I have to fix this week.

Now back to doing the firmware updates on the remaining boxes…

Update: turns out HP updates are worse than the original firmware in some ways. Not only the CPU Thermal Diodes are no longer represented, but the voltages lost their thresholds altogether! The end result of which is that now it says that it’s all a-ok! Even if the 3V battery is reported at 0.04V!. Which basically means that I have to set my own limits on things, but at least it should work as intended afterwards.

Oh and the DL160 G6? First of all, this time the firmware update has a web interface… to tell it which file to request from which TFTP server. Too bad that all the firmware updates that I can run on my systems require the bootcode to be updated as well, which means we’ll have to schedule some maintenance time when I come back from VDDs.

I’m in my network, monitoring!

While I was originally supposed to come here in Los Angeles to work as a firmware developer engineer, I’ve ended up doing a bit more than I was called for.. in particular it seems like I’ve been enlisted to work as a system/network administrator as well, which is not something that bad to be honest, even though it still means that I have to deal with a number of old RedHat and derivative systems.

As I said before this is good because it means that I can work on open-source projects, and Gentoo maintenance, during work hours, as the monitoring is done with Munin, Gentoo and, lately, Icinga. The main issue is of course having to deal with so many different versions of RedHat (there is at least one RHEL3, a few RHEL4, a couple of RHEL5, – and almost all of them don’t have subscriptions – some CentOS 5, plus the new servers that are Gentoo, luckily), but there are others.

Starting last week I started looking into Icinga to monitor the status of services: while Munin is good to know how things move over time and to have an idea of “what happened at that point”, it’s still not extremely good if you just want to know “is everything okay now or not?”. I also find most Munin plugins being simpler to handle than Nagios’s (which are what Icinga would be using), and since I already want the data available on graphs, I might just as well forward the notifications. This of course does not apply to boolean checks that are pretty silly on Munin.

There is some documentation in the Munin website on how to set up Nagios notifications, and it mostly works flawlessly for Icinga. With the one difference being that you have to change the NSCA configuration, as Icinga uses a different command file path, and a different user, which means you have to set up

nsca_user=icinga
nsca_group=icinga

command_file=/var/lib/icinga/rw/icinga.cmd

I’m probably going to make the init script have a selectable configuration file and install two pairs of configuration files, one in /etc/icinga and hte other in /etc/nagios so that each user can choose which ones to use. This should make it easier to set it up.

So while I don’t have much to say for now, and I have had little time to post about this in the past few days, what my plan, in regard to Icinga and Munin, consists of is primarily cleaning up the nagios-plugins ebuild (right now it just dumps all the contrib scripts without caring about them at all, let alone caring about the dependencies), and writing documentation on the wiki about Icinga the way I cleaned up the one about Munin — speaking of which, Debian decided to disable CGI in their packages as well, so now the default is to keep CGI support disabled unless required and it’s provided “as is”, without warranties it ever works. I also have to finish setting up the Munin async support, which becomes certainly useful at this point.

I’m also trying to fit in Ruby work as well as the usual Tinderbox mangling so … please bear with my lack of update.