Munin, HP servers and APC powerstrips

Yes, I know I start to get boring.

Today I spent at least half of my work day working on Munin plugins to monitor effectively some of the equipment we currently have at our co-location. This boils down to two metered APC powerstrips PDUs (let’s use their term, silly as it might sound). I think it’s worth to note the difference: APC provides switched and metered PDUs; the former should allow for having per-plug load data, and powering on and off of the single plug; the latter (what we have here) is much cheaper, does not allow you to turn them on and off, and simply give you a reading of the load per-phase. Given that our co-location has only single-phase power, we only get a reading per strip, which is still okay, it gives us more information than we had before at least.

Now, there are a few funny things with these strips: they have a network interface, which is cool, but they don’t use DHCP by default! You either have to set them up with the serial interface (which obviously is still very serial, not an USB adapter — and my laptop doesn’t have any serial port), or use a Windows software (which is actually written in Java and spend 98% of the install time copying an extra install of the JRE to the drive), or finally note down the MAC address when you install them, and then poisoning a system’s ARP table to “fake” an IP to the strip, sealing the deal by sending a 113 bytes ICMP packet to the strip via ping … no there is no use for a watermelon or a chimp, sorry Luca.

After finally completing the IP settings, I had to find my way to get the data out; the strips support either SNMPv1 or SNMPv3 — I discarded the former simply because it’s extremely insecure and I’d rather not even have that around, so I set up an user for munin. Next problem? snmpwalk did not report any useful data. The reason is actually quite simple: it doesn’t know which OIDs to probe for. Download the MIB data from APC and install it in the system, and it’s much happier.

Then I had to write a plugin for it. Which wasn’t too bad; the data is simple, too bad I couldn’t find a way to get, through SNMP, the high limit of current drain on the strip — it did report the configured (default) limits for near-overload and overload, which makes it very nice to set them up in Munin. Unfortunately only after writing the plugin I found out that the Munin contrib repository had already not one but two plugins trying to do the same. Neither is very good with it though: neither supported Munin’s SNMP framework, one had a very unclear licensing situation (which is unfortunately common on the contrib repository), and used sh and net-snmp’s command-line utilities to access the strip.

So after adding my plugin, and removing the two bad ones, I also looked into cleaning up the contrib tree a little bit. It’s far from perfect, there are still miscategorized plugins and duplicates, and others (such as one of the net-p2p/transmission monitors) which rely on external script files instead of being written in a single one. But at least I was able to remove and recategorize enough of them that it starts to make some sense. If you’re a Munin user and would like for Gentoo to provide more, better plugins, then please take your time to see which of the plugins currently in the contrib tree are trying to reimplement something and failing at it (lots of them I’m afraid will be, especially those related to APC UPSes), and get rid of them. There is also work to be done to bring even only the documentation of the plugins up to speed with the format used by Munin proper, and this is without talking about improving them to follow the right code style or anything.

I also spent some time improving my IPMI plugin (which you can find now on the contrib repository if you’re not a Gentoo user – if you’re a Gentoo user it takes the place of the original IPMI plugins shipped with Munin – after I removed all the others that were trying to do the same thing sometimes with twice as many lines of code than mine), and now it can monitor foreign hosts as well. How is this useful? Well, among other things it lets you monitor Windows boxes and other boxes where you either lack access or you can’t install any IPMI tool (I have a couple of systems that are running RHEL4 to monitor, okay?).

One interesting thing I learnt out of this experience is that it makes total sense to monitor voltages at least on HP servers. Beside the idea of monitoring for a PSU gone wrong, HP has one probe set to the CMOS battery, which is a 3V CR2032 Lithium Battery which will provide decreasing voltage, and thus will show in the list when it has to be replaced — unfortunately it also seems like their newest servers don’t have a probe there, which is bad (Excelsior has a VBAT which seems to be just the same thing).

This is all for today!

Some UPS notes (apcupsd et al)

If you didn’t notice, one of the packages I’ve been maintaining in Gentoo is apcupsd that is one of the daemons that can be used to control APC-produced UPS units. But for quite a while, my maintenance of the package was mostly limited to keeping it in a working state (with a wide range of different results, to be honest), since from the original (messy) ebuild I originally inherited, the current one is quite linear and, in my opinion, elegant.

But in the past two weeks or so, a few things happened that required me to look into the package more closely: a version bump, a customer having two UPSes connected to the same system, and the only remaining non-APC UPS in my home office declaring itself dead.

The version bump was the chance for me to finally fix the strict aliasing issue that is still present; this time, instead of simply disabling strict aliasing (quick, hacky way) I decided to look at the code, to make it actually strict aliasing compliant. This might not sound like much, but this kind of warnings is particularly nasty as you never know when it will cause an issue. Besides, it caused Portage to abort in stricter mode, that is what I use for packages I maintain myself.

Also, while my customer’s needs didn’t really influence my work on apcupsd itself, it caused me to look even more into munin’s apc_nis plugin as beforehand it was not configurable at all: it only ever used localhost:3551 to connect to APC NIS interface, which meant that if you wanted to change the port, or make it only listen on an external interface, you were out of luck. The patch to make this configurable is now part of Munin trunk, but I haven’t had time to ask Jeremy to add it to Gentoo as well (the few patches of mine to Munin are all merged upstream now, and Munin 2 will have those, and finally, native IPv6 transport, which means I probably won’t need to use ssh connections to fetch data over NAT, but just properly-configured firewalls).

There is another issue that comes up when having multiple UPS connected to the same box though: permanence of device names. While the daemon auto-discovers a single connected APC device, when you have multiple devices you need to explicitly tell it to access a given one. To do so, you could use the hiddev device paths, but the kernel does not make those persistent if you connect/disconnect the units. To solve this issue, the new ebuild for apcupsd that I committed today uses udev rules to provide /etc/apcups/usb-${SERIALNO} symlinks that you can use to provide stable reference to your apcupsd instances. I sent the rules upstream, hoping that they’ll be integrated in the next release.

A note here: while I’m a fan of autoconfiguration, I’m having trouble considering the idea of having apcupsd auto-started when an APC UPS is connected. The reason is not obvious though: while it would work fine if it was the only UPS and thus the only apcupsd instance to be present, if you had a second instance set up for a different UPS there would be no way to match the two together. This is at a minimum bothersome.

Speaking about init scripts, the powerfail init script currently only works in single-UPS configurations (whereas the main init script works fine in multiple UPS configurations), and even there it is a bit … broken. The powerfail flag can be written in a number of different places – the default and the Gentoo variants also point to different paths! – but this script does not take that into consideration at all. More to the point, the default, which uses /var/run might not be available at the shutdown init level since that would probably have been unmounted by that time. What I should do, I guess, is make it possible for the init script to fetch the configured value from the apcuspd configuration file, and move the default to use /run.

Next problem in my list is that apcaccess should not be among the superuser binaries, since it can be run from user just fine, but I’ll have to get that cleared with upstream first, it might break some scripts to move it in Gentoo only.

Finally, there is the problem that the sources of apcupsd are written with disregard for what many consider “library-only problems” – namely PIC – and has a very nasty copy-on-write scorecard. Unfortunately, some of the issues are so rooted into the design that I don’t feel up to fix the sources myself, but if somebody wanted a project to follow and optimise, that might be a good choice in this respect.

Sigh. I hope to find more time to fix the remaining issues with the scripts soon. For now if you have comments/notes that I might have missed, your feedback is welcome.

The hardware curse

DSCN2374

Those reading my blog or following me on identi.ca might remember I had some problems with my APC SmartUPS, with a used up battery that didn’t work properly. After replacement, I also had some troubles (which I’m not yet sure are worked out now to be honest); and at the end I settled for getting a new one to replace or work side-by-side it if it’ll work out properly. This is why in the photo above you can see two UPSes and one box (Yamato), even though there are actually three here (just one other turned on right now, though).

I’m not sure what has caused it, but since a little before I actually started activity as a self-employed developer and consultant, I’ve had huge hardware failures: two laptops, the very same week (my mother’s iBook and my MacBook Pro), the drum of the laser printer, the external iomega HD box (which was already a replacement for the same model failing last November), and lastly the (software) failure of the PCI soundcard.

Around the same time I also ended up needing some replacement for hardware that was now sub-optimal for my use (the Fast Ethernet switch was replaced by a Gigabit one because now there is Merrimac – my iMac – always turned on, and it makes heavy use of networking (especially with iSCSI), the harddisk, which there themselves replaced just last March, and helped out by one out of two disks in the external box, started being too small (thus why I got an extra one, a FreeAgent Xtreme running on eSATA, for one more terabyte of storage), my cordless phone required me to get another handset so that my mom’s call wouldn’t get to bother me, my cellphone (just recently) is being phased for a Nokia E75 so I could get a Business account (it was a Nokia E71 before), I got an HP all-in-one so that I had an ADF scanner to go in pair with the negative film scanner for archive purposes, and some more smaller things to go with that. I should also update, again, the router: after three years of good service, my 3Com starts to get the hit of age, and also starts to hit on limitations, including the very artificial limit of 32 devices listed in the WLAN MAC-addresses (and the fact that it doesn’t support IPv6).

Then there have been costs much more tied to work (not like the stuff I have mentioned is not part of my job anyway), like proprietary software licenses (Parallels Desktops, Visual Studio 2008, and soon Microsoft Office, Avira and Mac OS X Snow Leopard) and the smartcard reader. And of course rents for the new vserver (vanguard) and phone bills to pay. Given the amount of work my boxes do during the day, I’ll soon switch the power company over to me rather than my family and pay for that too, unless I decide to move the office on a real office (possibly one I can stay at any hour), and just keep one terminal at home or something like that (but then, what would I keep?). Oh and obviously there are a few more things like business cards and similar.

Now, all these are work expenses, so they are important up to a point; I actually get paid well enough to cover for these at the moment, even though I have now a quite funky wishlist which includes both leisure-related and work-related items (you’re free to help me with both, j/k). The problem is that I would have been much better off if I didn’t have all this mess. Especially considering that, as I have said before, I really wish I could get out of home soon.

But anyway, this is still work time!

Tinderbox suspension for a few days

If you’re following me on bugzilla, or looking at the feed of new bugs reported, or just are watching the trackers for the bugs I usually report most often (gcc 4.4 and glibc 2.10 failures mostly), you might have noticed that I haven’t been opening new bugs for a few days already. Please note that this is not a stable situation, it’s just a temporary setback, caused by – you guess – an hardware failure (you could answer this more quickly if you were following me on identi.ca since I have been writing about it).

This time the problem is the UPS’s battery that have consumed after two and a half years (which actually explains pretty well how it was that this year I won two APC gadgets, after years of filing in questionnaires). As soon as the load reaches 40%, the battery charge result to last for 1 to 3 minutes max, which is not right and certainly not enough to shut Yamato down from a tinderboxing run. I have also to note that the power load on the UPS varies between 27% almost idle, and 60% during full-blown build of all the cores which is what the tinderbox does.

I’ve called my supplier on Wednesday night to order the battery, it’ll arrive to the shop next Monday, but I’ll only be able to pick it up on Tuesday (since I have a work appointment and I’ll be around that place). During this time, I’m trying to keep the load on the UPS to the minimum so that it has at least a few minutes to stop everything down; this obviously includes not having the tinderbox running full time.

Myself, trying not to load the box with my own usage, I’m trying to take a few days to work on my paid job (that requires me to use Merrimac instead, which is on a different UPS, which as far as I can tell, is for now still keeping up), reading and watching a few films I’ve gotten lately and haven’t had time to watch, maybe I’ll be able to play a bit too, but I’m not counting on it much, I have already a full weekend with my job tasks.

In the past ten days I was able to read from cover to cover John Grisham’s The Rainmaker – I don’t particularly like lawyers; I don’t particularly dislike them either. I don’t know why I tent to devour Grisham’s books this way (the first I read, in Italian, The King of Torts, I finished during a whole night!). At any rat eI’m trying to resume my average reading, in 2007 I read 19 books, last year just seven; while of course the reason why I read so many more in 2007 is tied to the 42 days of hospital I went through, and the months that it took for me to recover, I’d like to at least reach 13 books this year. It’s also a sort of way to try cooling off before I burn out.

Anyway, will be back soon.

Enel: cosa vuoi spaccare oggi?

Ho già scritto tempo fa riguardo i miei disservizi con Enel ma si trattava della scorsa estate. Dopo quei blackout ce ne sono stati altri, a volte dovuti al mal tempo (quindi comprensibilissimi) altre volte senza spiegazione. A inizio anno è capitata una sera in cui si sono verificati due scatti (cadute di tensione istantanee), senza temporale o altro che potesse spiegarlo.

Ma d’accordo, una sera può capitare. Il problema è che è appena ricapitato. E non c’è temporale. E non c’è nessun avviso. E sono stati quattro scatti stavolta.

Qualcuno deve spiegarmi perché sempre in questa zona, da vent’anni da quando abitiamo qua. Vai a due vie di distanza e non hanno blackout a meno di trombe d’aria, vai in un comune limitrofo e neanche sanno cosa siano i blackout che durano più di mezz’ora, quando qua ne abbiamo da cinque ore al colpo, almeno una volta l’anno.

Enel, non sono molto contento del servizio, affatto.

Ora, TV, PlayStation 3 e AppleTV sono fuori dai gruppi di continuità, credo sia una buona idea prenderne uno nuovo, chissà che la APC mi rottami il vecchio Mustek.

Aggiornamento delle ore 2am: anche se il simpatico ominio che ha risposto a mia madre all’ufficio segnalazione guasti questo pomeriggio aveva assicurato che il problema era stato risolto, la verità era molto diversa. In effetti il guasto, grosso, molto grosso, non era stato identificato questo pomeriggio.

E puntualmente alle 21, altri scatti, e blackout, 803500 riporta che il problema sarà risolto “in dieci minuti”; dopo venti minuti riporta invece che la corrente ritornerà presumibilmente entro le ore 22:15; alle 22:30 il messaggio era stato rimosso e l’operatore mi riferisce che, purtroppo, il guasto è stato più grosso del previsto e in effetti sarà da attendere fino a mezzanotte. A mezzanotte e un quarto ovviamente il problema non è risolto, mi assicurano però che nel giro di massimo mezz’ora il gruppo elettrogeno entrerà in funzione.

All’una e dieci, mancando ancora corrente, mi spiegano che tipo di guasto è stato: un cavo di media tensione (ventimila volt) ha avuto un problema di isolamento e un palo scaricava a terra, causando i primi scatti. Pare però che questo abbia causato una fiammata alla vicina cabina di trasformazione, che a sua volta ha pregiudicato tre cabine di distribuzione. Poiché ripassare il cavo all’una di notte, sotto la pioggia, era un po’ difficile hanno deciso di andare per la costosa via di chiamare dei gruppi elettrogeni. Però due non sono stati abbastanza, e ovviamente qual’è l’area interessata dal terzo, che è stato chiamato da Treviso? La mia.

Domani comincierò a vedere per fare una bella protesta generale contro Enel visto che mi pare impossibile che un cavo così, dall’oggi al domani, abbia problemi di isolamento. Più probabile che il problema di isolamento sia della zona e che qualcuno abbia ronfato sui controlli. Comunque, non è simpatico.

Downtime and new UPS

Sorry for the downtime guys, the update to baselayout 2 on Farragut wasn’t as easy as expected, I have some talk to do with Roy about it ;)

Now, this is probably going to be the fourth night I’m not gonna sleep well, today I had a support request to take care of in the afternoon and I was supposed to go out with a few friends tonight, but I was so tired that I started having fever, so I had to get a raincheck for that.

The good news is that I received the new UPS, an APC SmartUPS 1000VA, that seems to be able to take enterprise and farragut up for 1 hour and a quarter, which is quite good (the other UPS would take the monitors and the network equipment up for also an hour, which is not bad, and should cover most of short-time non-planned outages.

Unfortunately, if you have followed me for a while, you know there’s no flawless hardware acquisition for me; this time the problem comes from the software needed to control the two UPSes; please note that there are two.

First of all, when I started using an UPS, I used apcupsd but then I moved to nut because it had a decent graphical control software (knutclient); what was the problem with this? Well, nut was confused by the two UPSes, that being both APC, shares the same identical vendor and product ID for the USB device. So it was not a nice thing.

I’ve then decided to come back to apcupsd, but the latest version is not in portage (again) so I bumped it locally with the patch on Bugzilla, and added a gnome useflag for the graphical control utility that is now present. This version is important to me, not only because of the utility, but also because with this version is possible to choose the configuration file at runtime, as it’s not hardcoded during build.

This is also important because in previous versions the only way to have more than one UPS being monitored on the same box with apcupsd required to have two different builds on different places. What I want to do now is to rewrite the init script so that it is multiplexed (like rbot, mt-daapd, openvpn and so on), that way I can simply start two apcupsd instances to have all I need (note that one of them won’t be shutting off my box at all, as it would just keep monitoring the UPS that is actually connected to the monitors and networking.

Once Baselayout 2 is more functional on Farragut, I’ll probably move apcupsd instances there, and network it down here, it would be safer on the long run as enterprise might not be up when farragut is.

I still haven’t been able to finish working on rbot’s changes I need, sigh. And don’t get me started on xine-lib, I’m taking a few days off because what I found myself working on now is a veeery bad thing I’m afraid…. (bit per sample transcoding during output).