Monitoring HP servers

Sometimes this blog has something like “columns” for long-term topics that keep re-emerging (no pun intended) from time to time. Since I came back to the US last July you can see that one of the big issues I fight with daily is HP servers.

Why is the company I’m working for using HP servers? Mostly because they didn’t have a resident system administrator before I came on board, and just recently they hired an external consultant to set up new servers … the one who set up my nightmare: Apple OS X Server so I’m not sure which of the two options I prefer.

Anyway, as you probably know if you follow my blog, I’ve been busy setting up Munin and Icinga to monitor the status of services and servers — and that helped quite a bit over time. Unfortunately, monitoring HP servers is not easy. You probably remember I wrote a plugin so I could monitor them through IPMI — it worked nicely until I actually got Albert to expose the thresholds in the ipmi-sensors output, then it broke because HP’s default thresholds are totally messed up and unusable, and it’s not possible to commit new thresholds.

After spending quite some time playing with this, I ended up with write access to Munin’s repositories (thanks, Steve!) and I can now gloat be worried about having authored quite a few new Munin plugins (the second generation FreeIPMI multigraph plugin is an example, but I also have a sysfs-based hwmon plugin that can get all the sensors in your system in one sweep, a new multigraph-capable Apache plugin, and a couple of SNMP plugins to add to the list). These actually make my work much easier, as they send me warnings when a problem happens without having to worry about it too much, but of course are not enough.

After finally being able to replace the RHEL5 (without a current subscription) with CentOS 5, I’ve started looking in what tools HP makes available to us — and found out that there are mainly two that I care about: one is hpacucli, which is also available in Gentoo’s tree, and the other is called hp-health and is basically a custom interface to the IPMI features of the server. The latter actually has a working, albeit not really polished, plugin in the Munin contrib repository – which I guess I’ll soon look to transform into a multigraph capable one; I really like multigraph – and that’s how I ended up finding it.

At any rate at that point I realized that I did not add one of the most important checks: the SMART status of the harddrives — originally because I couldn’t get smartctl installed. So I went and checked for it — the older servers are almost all running as IDE (because that’s the firmware’s default.. don’t ask), so those are a different story altogether; the newer servers running CentOS are using an HP controller with SAS drives, using the CCISS (block-layer) driver from the kernel, while one is running Gentoo Linux, and uses the newer, SCSI-layer driver. All of them can’t use smartctl directly, but they have to use a special command: smartctl -d cciss,0 — and then either point it to /dev/cciss/c0d0 or /dev/sda depending on how which of the two kernel drivers you’re using. They don’t provide all the data that they provide for SATA drives, but they provide enough for Munin’s hddtemp_smartctl and they do provide an health status…

For what concerns Munin, your configuration would then be something like this in /etc/munin/plugin-conf.d/hddtemp_smartctl:

[hddtemp_smartctl]
user root
env.drives hd1 hd2
env.type_hd1 cciss,0
env.type_hd2 cciss,1
env.dev_hd1 cciss/c0d0
env.dev_hd2 cciss/c0d0

Depending on how many drives you have and which driver you’re using you will have to edit it of course.

But when I tried to use the default check_smart.pl script from the nagios-plugins package I had two bad surprises: the first is that they try to validate the parameter passed to the plugin to identify the device type to smartctl, refusing to work for a cciss type, and the other that it didn’t consider the status message that is printed by this particular driver. I was so pissed, that instead of trying to fix that plugin – which still comes with special handling for IDE-based harddrives! – I decided to write my own, using the Nagios::Plugin Perl module, and releasing it under the MIT license.

You can find my new plugin in my github repository where I think you’ll soon find more plugins — as I’ve had a few things to keep under control anyway. The next step is probably using the hp-health status to get a good/bad report, hopefully for something that I don’t get already through standard IPMI.

The funny thing about HP’s utilities is that they for the most part just have to present data that is already available from the IPMI interface, but there are a few differences. For instance, the fan speed reported by IPMI is exposed in RPMs — which is the usual way to expose the speed of fans. But on the HP utility, fan speed is actually exposed as a percentage of the maximum fan speed. And that’s how their thresholds are exposed as well (as I said, the thresholds for fan speed are completely messed up on my HP servers).

Oh well, anything else can happen lately, this would be enough for now.

Nagios, Icinga, and a security nightmare

If you’ve followed my blog in the past few weeks, I’ve been doing quite some work between Munin and the Nagios packaging (I leave Icinga to prometheanfire!), as well as working closely with Munin upstream by feeding them patches — yesterday I actually got access to the Munin contrib repository so now I can help make sure that the plugins reach a state where they can actually be redistributed and packaged.

I also spent some time clearing up what was once called nagios-nsca and nagios-nrpe (which are now just nsca and nrpe since they work just fine with Icinga as well, and the nagios- part was never part of the upstream names anyway; kudos to Markos for noticing I didn’t check correctly for revdeps, by the way) — now you got minimal USE flags that you can turn on to avoid building their respective daemon, although you have to remember that you have to enable minimal for nsca on the nodes, and for nrpe on the master. They also both come with new init scripts that are simplified and leverage the new functionalities in OpenRC.

There is though something that is not making me sleep well — and that’s beside the usual problems I have with sleeping. Let me try to explain.

Some of the Nagios/Icinga tests can’t be executed remotely as they are, obviously — things like S.M.A.R.T. monitoring need to be executed on the box they have to monitor obviously. So how do you fix this? You use nrpe — the Nagios Remote Plugin Executor. This is basically a daemon that is used to execute commands on the node (more or less the way Munin’s node works). Unfortunately, unlike Munin, both Icinga proper and NRPE don’t allow you to choose on a per-plugin basis which user to use (to do so, Munin has its node running as root).

Instead, everything is executed by the nagios user, and if you need to access something that the user can’t access, you can work it around by using a setuid-root plugin (these are tied to the suid USE flag for nagios-plugins in Gentoo). But this, of course, only works for binaries, not scripts. And here’s the first problem: to check the S.M.A.R.T. status of an IDE drive, you can use the check_ide_smart tool that reimplements the whole protocol… to check the status of a SATA drive you should use check_smart.pl that uses SmartMonTools to take care of it.

But how can the script access the disk? Well, it does it in the simplest way: it uses sudo. Of course this means that the nagios user has to have sudo access… afraid that this would get people to give unconditional sudo access to the nagios user, I decided to work it around by installing my own configuration file for sudo in the ebuild, making use of the new /etc/sudoers.d folder, which means that on a default install, just the commands that are expected will be allowed for the nagios user. And this is good.

But sometimes the plugins themselves don’t seem to care about using sudo directly; instead they rely on being executed with an user that has enough privileges; for this reason, the nrpe configuration allows you to prefix all commands with any command of your choice, with the default being… sudo! And their documentation suggest to make sure that the user running nrpe does not have write access to the directory to avoid security issues… you can understand that it’s not the only bad idea you could have, there.

Sigh, this stuff is a complete security nightmare, truly.

I’m in my network, monitoring!

While I was originally supposed to come here in Los Angeles to work as a firmware developer engineer, I’ve ended up doing a bit more than I was called for.. in particular it seems like I’ve been enlisted to work as a system/network administrator as well, which is not something that bad to be honest, even though it still means that I have to deal with a number of old RedHat and derivative systems.

As I said before this is good because it means that I can work on open-source projects, and Gentoo maintenance, during work hours, as the monitoring is done with Munin, Gentoo and, lately, Icinga. The main issue is of course having to deal with so many different versions of RedHat (there is at least one RHEL3, a few RHEL4, a couple of RHEL5, – and almost all of them don’t have subscriptions – some CentOS 5, plus the new servers that are Gentoo, luckily), but there are others.

Starting last week I started looking into Icinga to monitor the status of services: while Munin is good to know how things move over time and to have an idea of “what happened at that point”, it’s still not extremely good if you just want to know “is everything okay now or not?”. I also find most Munin plugins being simpler to handle than Nagios’s (which are what Icinga would be using), and since I already want the data available on graphs, I might just as well forward the notifications. This of course does not apply to boolean checks that are pretty silly on Munin.

There is some documentation in the Munin website on how to set up Nagios notifications, and it mostly works flawlessly for Icinga. With the one difference being that you have to change the NSCA configuration, as Icinga uses a different command file path, and a different user, which means you have to set up

nsca_user=icinga
nsca_group=icinga

command_file=/var/lib/icinga/rw/icinga.cmd

I’m probably going to make the init script have a selectable configuration file and install two pairs of configuration files, one in /etc/icinga and hte other in /etc/nagios so that each user can choose which ones to use. This should make it easier to set it up.

So while I don’t have much to say for now, and I have had little time to post about this in the past few days, what my plan, in regard to Icinga and Munin, consists of is primarily cleaning up the nagios-plugins ebuild (right now it just dumps all the contrib scripts without caring about them at all, let alone caring about the dependencies), and writing documentation on the wiki about Icinga the way I cleaned up the one about Munin — speaking of which, Debian decided to disable CGI in their packages as well, so now the default is to keep CGI support disabled unless required and it’s provided “as is”, without warranties it ever works. I also have to finish setting up the Munin async support, which becomes certainly useful at this point.

I’m also trying to fit in Ruby work as well as the usual Tinderbox mangling so … please bear with my lack of update.

Monitoring a single server

If you follow my delicious you might have noticed some recently tagged content about Ruby and Gtk+. As you might guess, I’m going to resume working with Ruby and in particular I’m going to write a graphical application using Ruby-Gtk2.

The problem I’mt rying to solve is related to the downtime I had; the problem is that I cannot stay logged in in SSH with top open at any time of the day in my vserver to make sure everything is alright, and thus I ended up having some trouble because a script possibly went haywire (I’m not sure whether it went haywire before or after the move of the vserver to new hardware).

Since using Nagios is a bit of an overkill, considering I have to monitor a single box and I don’t want to keep looking at something (included my email), I’ve decided that the solution is writing a desktop application that will monitor the status of the box and notify me right away that something is not going as it should. Now of course this is a very nice target but a difficult one to achieve, to start with “how the heck do you get the data out of the box?”.

Luckily, for my High School final exam I presented a software that already was a stake to the solution, ATMOSphere (yes I know the site is lame and the project is well dead), which was a software to monitor and configure my router, a D-Link DSL-500 (Generation I) that used as operating system ATMOS (by GlobespanVirata I still have the printed in-depth manuals for the complex CLI interface it had, both serial and telnet protocol based); together with the CLI protocol for setting up basic parameters, I used the SNMP to read most parameters out of it. This is the reason why you might find my name related to a library called libksnmp; that library was a KDE-like interface to the net-snmp library (which was at least at the time a mess to develop with), which I used not only for ATMOSphere, but also for KNetStat to access remote interfaces (like the one of my router); since then I haven’t worked with SNMP at all, albeit I’m sure my current router also supports it.

Despite being called (Anything but — ) Simple Network Management Protocol I’d expect SNMP to be much more often used for querying rather than actually manage, especially considering the bad excuse of an authentication system that was included in the first two versions (not like the one included in version 3 is much better). Also it’s almost certainly a misnomer since the OID approach is probably one of the worst one I’ve seen in my life for a protocol. But beside this, the software is very well present (net-snmp) and nowadays there is a decent client library too, in Ruby, which makes it possible to write monitoring software relatively quickly.

My idea was to just write up something that sits in my desktop tray, querying on a given interval the server for its status, the nice thing here would be being able notify me as soon as there’s a problem, by both putting a big red icon in my tray and by showing up a message through libnotify to tell me what the problem is. This would allow me to know immediately if something went haywire. The problem is: how do you define “there’s a problem”? This is the part I’m trying to solve right now.

While SNMP specifications allows to set errors, so you could just tell snmpd when to report there’s an error, so that it was not the agent but the server to know when to report problems, which is very nice since you just need to configure it on the server and even if you change workstation you’ll have the same parameters; unfortunately this has limited scope: on most routers or SoHo network equipment you won’t find much configuration for SNMP, the D-Link ones, albeit supporting SNMP quite well, didn’t advertise it on the manual nor had configuration options on the wepages, the 3Com I have now has some configuration for SNMP traps and has support for writing through SNMP (luckily, disabled by default); I guess I’ll have to add support for writing at least some parameters so I could set up devices like these (that supports writing to SNMP to set up the alarms). But for those who also lack writing support, I suppose the only way would be to add some support for client-side rules that tells the agent when to issue a warning. I guess that might be a further extension.

Right now I’m a bit at a stop because the version of Ruby-Gtk2 in portage does not support GtkBuilder, which makes writing the interface quite a bit of an issue, but once the new version will be in, I’ll certainly be working on something to apply there. In the mean time, I’m open to suggestions as to other monitoring applications that might save me from writing my own, or in ideas on how I could approach the problems that will present themselves. I think at least I’ll be adding some drop-down widget like the one for the worldclock in Gnome (where the timezones are shown) with graphs of the interface in/out bandwidth use (which would be nice so I could resume monitoring my router too).

Okay for now I suppose I’ll stop here, I’ll wait for the dependencies I’ll need to be in Portage, so maybe someone will find me something better to do and a software that does what I look for.