Yes, I know I start to get boring.
Today I spent at least half of my work day working on Munin plugins to monitor effectively some of the equipment we currently have at our co-location. This boils down to two metered APC powerstrips PDUs (let’s use their term, silly as it might sound). I think it’s worth to note the difference: APC provides switched and metered PDUs; the former should allow for having per-plug load data, and powering on and off of the single plug; the latter (what we have here) is much cheaper, does not allow you to turn them on and off, and simply give you a reading of the load per-phase. Given that our co-location has only single-phase power, we only get a reading per strip, which is still okay, it gives us more information than we had before at least.
Now, there are a few funny things with these strips: they have a network interface, which is cool, but they don’t use DHCP by default! You either have to set them up with the serial interface (which obviously is still very serial, not an USB adapter — and my laptop doesn’t have any serial port), or use a Windows software (which is actually written in Java and spend 98% of the install time copying an extra install of the JRE to the drive), or finally note down the MAC address when you install them, and then poisoning a system’s ARP table to “fake” an IP to the strip, sealing the deal by sending a 113 bytes ICMP packet to the strip via ping
… no there is no use for a watermelon or a chimp, sorry Luca.
After finally completing the IP settings, I had to find my way to get the data out; the strips support either SNMPv1 or SNMPv3 — I discarded the former simply because it’s extremely insecure and I’d rather not even have that around, so I set up an user for munin. Next problem? snmpwalk
did not report any useful data. The reason is actually quite simple: it doesn’t know which OIDs to probe for. Download the MIB data from APC and install it in the system, and it’s much happier.
Then I had to write a plugin for it. Which wasn’t too bad; the data is simple, too bad I couldn’t find a way to get, through SNMP, the high limit of current drain on the strip — it did report the configured (default) limits for near-overload and overload, which makes it very nice to set them up in Munin. Unfortunately only after writing the plugin I found out that the Munin contrib repository had already not one but two plugins trying to do the same. Neither is very good with it though: neither supported Munin’s SNMP framework, one had a very unclear licensing situation (which is unfortunately common on the contrib repository), and used sh and net-snmp’s command-line utilities to access the strip.
So after adding my plugin, and removing the two bad ones, I also looked into cleaning up the contrib tree a little bit. It’s far from perfect, there are still miscategorized plugins and duplicates, and others (such as one of the net-p2p/transmission monitors) which rely on external script files instead of being written in a single one. But at least I was able to remove and recategorize enough of them that it starts to make some sense. If you’re a Munin user and would like for Gentoo to provide more, better plugins, then please take your time to see which of the plugins currently in the contrib tree are trying to reimplement something and failing at it (lots of them I’m afraid will be, especially those related to APC UPSes), and get rid of them. There is also work to be done to bring even only the documentation of the plugins up to speed with the format used by Munin proper, and this is without talking about improving them to follow the right code style or anything.
I also spent some time improving my IPMI plugin (which you can find now on the contrib repository if you’re not a Gentoo user – if you’re a Gentoo user it takes the place of the original IPMI plugins shipped with Munin – after I removed all the others that were trying to do the same thing sometimes with twice as many lines of code than mine), and now it can monitor foreign hosts as well. How is this useful? Well, among other things it lets you monitor Windows boxes and other boxes where you either lack access or you can’t install any IPMI tool (I have a couple of systems that are running RHEL4 to monitor, okay?).
One interesting thing I learnt out of this experience is that it makes total sense to monitor voltages at least on HP servers. Beside the idea of monitoring for a PSU gone wrong, HP has one probe set to the CMOS battery, which is a 3V CR2032 Lithium Battery which will provide decreasing voltage, and thus will show in the list when it has to be replaced — unfortunately it also seems like their newest servers don’t have a probe there, which is bad (Excelsior has a VBAT which seems to be just the same thing).
This is all for today!