Munin, HP servers and APC powerstrips

Yes, I know I start to get boring.

Today I spent at least half of my work day working on Munin plugins to monitor effectively some of the equipment we currently have at our co-location. This boils down to two metered APC powerstrips PDUs (let’s use their term, silly as it might sound). I think it’s worth to note the difference: APC provides switched and metered PDUs; the former should allow for having per-plug load data, and powering on and off of the single plug; the latter (what we have here) is much cheaper, does not allow you to turn them on and off, and simply give you a reading of the load per-phase. Given that our co-location has only single-phase power, we only get a reading per strip, which is still okay, it gives us more information than we had before at least.

Now, there are a few funny things with these strips: they have a network interface, which is cool, but they don’t use DHCP by default! You either have to set them up with the serial interface (which obviously is still very serial, not an USB adapter — and my laptop doesn’t have any serial port), or use a Windows software (which is actually written in Java and spend 98% of the install time copying an extra install of the JRE to the drive), or finally note down the MAC address when you install them, and then poisoning a system’s ARP table to “fake” an IP to the strip, sealing the deal by sending a 113 bytes ICMP packet to the strip via ping … no there is no use for a watermelon or a chimp, sorry Luca.

After finally completing the IP settings, I had to find my way to get the data out; the strips support either SNMPv1 or SNMPv3 — I discarded the former simply because it’s extremely insecure and I’d rather not even have that around, so I set up an user for munin. Next problem? snmpwalk did not report any useful data. The reason is actually quite simple: it doesn’t know which OIDs to probe for. Download the MIB data from APC and install it in the system, and it’s much happier.

Then I had to write a plugin for it. Which wasn’t too bad; the data is simple, too bad I couldn’t find a way to get, through SNMP, the high limit of current drain on the strip — it did report the configured (default) limits for near-overload and overload, which makes it very nice to set them up in Munin. Unfortunately only after writing the plugin I found out that the Munin contrib repository had already not one but two plugins trying to do the same. Neither is very good with it though: neither supported Munin’s SNMP framework, one had a very unclear licensing situation (which is unfortunately common on the contrib repository), and used sh and net-snmp’s command-line utilities to access the strip.

So after adding my plugin, and removing the two bad ones, I also looked into cleaning up the contrib tree a little bit. It’s far from perfect, there are still miscategorized plugins and duplicates, and others (such as one of the net-p2p/transmission monitors) which rely on external script files instead of being written in a single one. But at least I was able to remove and recategorize enough of them that it starts to make some sense. If you’re a Munin user and would like for Gentoo to provide more, better plugins, then please take your time to see which of the plugins currently in the contrib tree are trying to reimplement something and failing at it (lots of them I’m afraid will be, especially those related to APC UPSes), and get rid of them. There is also work to be done to bring even only the documentation of the plugins up to speed with the format used by Munin proper, and this is without talking about improving them to follow the right code style or anything.

I also spent some time improving my IPMI plugin (which you can find now on the contrib repository if you’re not a Gentoo user – if you’re a Gentoo user it takes the place of the original IPMI plugins shipped with Munin – after I removed all the others that were trying to do the same thing sometimes with twice as many lines of code than mine), and now it can monitor foreign hosts as well. How is this useful? Well, among other things it lets you monitor Windows boxes and other boxes where you either lack access or you can’t install any IPMI tool (I have a couple of systems that are running RHEL4 to monitor, okay?).

One interesting thing I learnt out of this experience is that it makes total sense to monitor voltages at least on HP servers. Beside the idea of monitoring for a PSU gone wrong, HP has one probe set to the CMOS battery, which is a 3V CR2032 Lithium Battery which will provide decreasing voltage, and thus will show in the list when it has to be replaced — unfortunately it also seems like their newest servers don’t have a probe there, which is bad (Excelsior has a VBAT which seems to be just the same thing).

This is all for today!

Munin again, sorry!

Okay this might start to be boring, but I’m still working on Munin, and that means you have to read (or not) another post on the topic.

So last time I was talking about Munin and I was intending to write about SNMP, but with one thing and another I ended up just writing about IPMI.

The only thing I want to put in clear now about SNMP is that I’m still working on it — the main issue seems to be that the Munin plugins have a default timeout of 10 seconds, but the multigraph plugin that should be used for mapping multiple SNMP interface takes about 24 seconds for the switch I’d like to monitor here at my workplace. This should technically be solvable by using the new async daemon support, but this requires, in turn, support for the SSH transport, which can’t be used without relaxing the security applied to munin user (by making it loggable). There is also another point I have to make: I intend to modify the plugin and make a second one that actually allows to graph all the interfaces on one single entry, so that it actually allows me for more interesting data.

For what concerns IPMI instead: the new version of Munin 2 in Gentoo replaces both the IPMI plugins (ipmi_ and ipmi_sensor_) with my version based off FreeIPMI, which now not only outperforms the original ones (thanks to FreeIPMI caching, which is enabled by default), but also outfeatures the original plugins! Here’s the gist of it:

  • by using FreeIPMI, the plugin is shorter, much shorter, as the output is malleable to script handling;
  • I’ve made the script accept both the names used by ipmi_ and ipmi_sensor_ — both reported the same data, but one used bash and gawk (the GNU version only), while the other used python;
  • thanks to Kent I was able to get enough data to make it report power and current as well as temperatures and fans; I still have to implement voltage that was not implemented in the previous plugins either;
  • thanks to Albert, the new 1.2 series of FreeIPMI has support for threshold output — which was the remaining missing feature; I’ve implemented it in the plugin and I’ve patched 1.1.6 (and 1.1.7) to support the option as well;
  • and since I cared … the new version uses POSIX-compatible sh syntax and POSIX awk syntax instead of the GNU variants of both.

My next development is going to be supporting what they call “foreign hosts” on the plugin, so you can actually get Munin to monitor an IPMI SBMC instead of the local BMC interface. This will probably come soonish in Gentoo together with the support for asyncd.

What remains now is finding a way to package the contributed plugins which needs to be available, especially since Steve said he doesn’t want new plugins in the main package, and everything else has to come through the new contrib repository.

And yes, I still have to fix HTML generations, Justin, I know. I’m trying to find what the heck is wrong with it.

Monitoring a single server

If you follow my delicious you might have noticed some recently tagged content about Ruby and Gtk+. As you might guess, I’m going to resume working with Ruby and in particular I’m going to write a graphical application using Ruby-Gtk2.

The problem I’mt rying to solve is related to the downtime I had; the problem is that I cannot stay logged in in SSH with top open at any time of the day in my vserver to make sure everything is alright, and thus I ended up having some trouble because a script possibly went haywire (I’m not sure whether it went haywire before or after the move of the vserver to new hardware).

Since using Nagios is a bit of an overkill, considering I have to monitor a single box and I don’t want to keep looking at something (included my email), I’ve decided that the solution is writing a desktop application that will monitor the status of the box and notify me right away that something is not going as it should. Now of course this is a very nice target but a difficult one to achieve, to start with “how the heck do you get the data out of the box?”.

Luckily, for my High School final exam I presented a software that already was a stake to the solution, ATMOSphere (yes I know the site is lame and the project is well dead), which was a software to monitor and configure my router, a D-Link DSL-500 (Generation I) that used as operating system ATMOS (by GlobespanVirata I still have the printed in-depth manuals for the complex CLI interface it had, both serial and telnet protocol based); together with the CLI protocol for setting up basic parameters, I used the SNMP to read most parameters out of it. This is the reason why you might find my name related to a library called libksnmp; that library was a KDE-like interface to the net-snmp library (which was at least at the time a mess to develop with), which I used not only for ATMOSphere, but also for KNetStat to access remote interfaces (like the one of my router); since then I haven’t worked with SNMP at all, albeit I’m sure my current router also supports it.

Despite being called (Anything but — ) Simple Network Management Protocol I’d expect SNMP to be much more often used for querying rather than actually manage, especially considering the bad excuse of an authentication system that was included in the first two versions (not like the one included in version 3 is much better). Also it’s almost certainly a misnomer since the OID approach is probably one of the worst one I’ve seen in my life for a protocol. But beside this, the software is very well present (net-snmp) and nowadays there is a decent client library too, in Ruby, which makes it possible to write monitoring software relatively quickly.

My idea was to just write up something that sits in my desktop tray, querying on a given interval the server for its status, the nice thing here would be being able notify me as soon as there’s a problem, by both putting a big red icon in my tray and by showing up a message through libnotify to tell me what the problem is. This would allow me to know immediately if something went haywire. The problem is: how do you define “there’s a problem”? This is the part I’m trying to solve right now.

While SNMP specifications allows to set errors, so you could just tell snmpd when to report there’s an error, so that it was not the agent but the server to know when to report problems, which is very nice since you just need to configure it on the server and even if you change workstation you’ll have the same parameters; unfortunately this has limited scope: on most routers or SoHo network equipment you won’t find much configuration for SNMP, the D-Link ones, albeit supporting SNMP quite well, didn’t advertise it on the manual nor had configuration options on the wepages, the 3Com I have now has some configuration for SNMP traps and has support for writing through SNMP (luckily, disabled by default); I guess I’ll have to add support for writing at least some parameters so I could set up devices like these (that supports writing to SNMP to set up the alarms). But for those who also lack writing support, I suppose the only way would be to add some support for client-side rules that tells the agent when to issue a warning. I guess that might be a further extension.

Right now I’m a bit at a stop because the version of Ruby-Gtk2 in portage does not support GtkBuilder, which makes writing the interface quite a bit of an issue, but once the new version will be in, I’ll certainly be working on something to apply there. In the mean time, I’m open to suggestions as to other monitoring applications that might save me from writing my own, or in ideas on how I could approach the problems that will present themselves. I think at least I’ll be adding some drop-down widget like the one for the worldclock in Gnome (where the timezones are shown) with graphs of the interface in/out bandwidth use (which would be nice so I could resume monitoring my router too).

Okay for now I suppose I’ll stop here, I’ll wait for the dependencies I’ll need to be in Portage, so maybe someone will find me something better to do and a software that does what I look for.

When the tools don’t work

So, last night after the meeting I wanted to relax a bit, and what’s more relaxing than writing a software you need in Ruby? And as I moved the DSL router on the other floor, what I need most now is a way to tell if the DSL line is connected or not, without using the HTTP interface, or going downstairs to look at the LEDs.

I then started writing the basic Korundum app, so that I could start creating a monitoring software using SNMP (and snmplib, that sounds way saner than Net-SNMP); of course the SCM I’ve been using was GIT, what else could I use? :) I also wanted to try the Version Control features of emacs with GIT, but this seemed not to work correctly :/

Basically even if I tell emacs to commit and give it a log message, the commit never happens :( I’ll see to debug that later today, maybe I can come to a fix that will make the support work, but first I need to understand a bit of lisp, which is the hard part.

On a quasi-related note, related to the router, mostly, today I had to plug the second UPS in because there was a power line problem, and being disconnected while the rest of the network worked was a bit annoying ;)
Now I shouldn’t have any problem, and this is good especially because the weather doesn’t promise any sun in the next day at least.

Voodoo programming… when a debug info solves it all!

Ok I found some time before getting my new job to work on KNetLoad, as was found a crash when adding KNetLoad to a vertical bar.
I also wanted to start working on the SNMP stuff, so I reworked another time the reading code. This time it seems to be quite flexible to enhance it in the future.

I implemented the SNMP class, and then tried it… after some changes it worked… then I tried to put only the SNMP interface in the monitor.. and it wasn’t working at all.

I messed up with everything, I wrote testcases for my library which worked without a glitch, I tried with net-snmp console utilities, and they worked, then I though the problem was with QT events’ loop, and I changed the way the timer worked, but still nothing: it was always going to timeout when I requested the data.

At the end, I put a debug call on the constructor of the class… and it started working right… I think the problem was the short delay between the library initialization and the session opening.

I hate fridays… :)

Quattro di notte

Questa è l’ora e questo è il post.
Sicuramente vi starete domandando cosa sto facnedo sveglio a quest’ora, semplicemente non riesco a prendere il sonno.

Quindi di cosa parlerò in questa entry? Beh di nulla in particolare, solamente voglio che sia un buon punto di partenza per una nuova serie di entry.

Non ho più il tempo materiale di scrivere entry su qualsiasi cosa mi accada o su qualsiasi articolo interessante trovato su slashdot o punto informatico (anche perché per i secondi non ci sono problemi, solitamente si commentano da soli, mentre per quanto riguarda slashdot, ultimamente mi sta lasciando molto a desiderare: le notizie arrivano con giorni se non anni (nel caso di OpenBGPD) di ritardo, i commenti sono sempre più stumidi e meno da nerd, e non ho sopportato i continui articoli riguardo le elezioni americane. Non me ne poteva importare di meno!

L’ultima notizia di /. che mi è interessata è stata quella sul ripoff di ReactOS, anche se pure quella era corredata dai soliti commenti stupidi che mi ricordano perché non frequento più i forum.

Anche i lavori con Hypnos vanno a rilento, principalmente perché io sto battendo la testa contro la libreria del gcc che non ne vuol sapere di funzionare a dovere, e perché Chrono e Kheru sono stanchi rispettivamente per Università e lavoro.

Negli ultimi tempi ho lavorato un po’ di più su KNetLoad e KVBA, portando tutta la parte di configurazione sul nuovo framework KConfig XT, molto bello, tra l’altro mi ha ispirato per lo script autosettings che ho scritto per Hypnos, ma anche qua i lavori vanno a rilento perché non so se ci siano già dei bindings SNMP per KDE o devo riciclare il lavoro fatto per atmosphere, per supportare la ricezione delle statistiche di KNetLoad via SNMP. Certo però quando funzionerà questo sicuramente KNetLoad potrà finalmente attestarsi ad un livello nettamente superiore a quello attuale, visto che ci sono almeno due progetti concorrenti: KNetInfo e KNemo.

Invece quello che ultimamente mi si profila per la rimozione è XMMS, che con le ultime release diventa più lento, più cpu-intensive e sempre meno integrato. Anche il semplice fatto che utilizzi GTK1 lo rende abbastanza odioso, visto che, dopo aver disattivato i bitmap fonts su xorg, non ne vuol sapere di disegnare i font con un modo decente. Inoltre quando ho a che fare con canzoni col titolo giapponese in Kanji, si rifiuta di mostrarmeli.

Purtroppo però XMMS2 e Beep Media Player sono progetti ancora troppo sperimentali, e non supportano tutti gli input plugin di XMMS, tra cui c’è musepack (MPC) in cui ho un po’ di brani. L’unico sostituto decente fin ora sembra essere Kaffeine, che usa xine come backend, ma non sono sicuro se xine-lib supporti Musepack, dovrei verificare. Purtroppo conosco veramente poco riguardo le codifiche audio/video e non credo di poter lavorare a qualsivoglia modifica per xine-lib che aggiunga il supporto a musepack.

Domani dovrò anche tentare di ripristinare il supporto al telecomando via LIRC, visto che le ebuild di gentoo funzionano sui kernel 2.6. L’unico fatto è che devo decidere se provare l’interfaccia di KDE (che si sostituisce a lircd) o no. Una volta usavo lircd perché usavo il telecomando con xmms, zapping e Xine (con xine-ui), ma se passo ad usare Kaffine anche per l’audio, visto che zapping non lo uso più potrei semplicemente usare l’interfaccia di KDE.

Devo ammettere che questa entry è abbastanza fuori dagli schemi che ho imposto negli ultimi tempi al blog, ma è stato solo un modo per mettere su schermo i pensieri che ho in questo momento da risolvere, se qualcuno ha voglia di aiutarmi nei commenti, ben venga.