What happened to my SSL certificates? A personal postmortem

I know that for most people this is not going to be very interesting, but my current job is teaching me that it’s always a good idea to help people learn from your own mistakes; especially so if you let others comment on said mistakes to see what you could have done better. So here it goes.

Let’s start to say that I’m an idiot. Last month I was clever enough to update the certificate for xine-project which was almost to expire. Unfortunately, I wasn’t so clever as to notice that the rest of my certificates were going to expire give or take at the same time. Nor I went remembering that my StartSSL verification was expiring, as last year I was in the US when that happened, and I had some trouble as my usual Italian phone number was unavailable. I actually got a notification that my certificate was expiring already when I was in London, last week. I promised myself to act on it as soon as I would get home to Dublin, but of course I ended up forgetting about it.

And then this morning came, when I got notified via Twitter that my blog’s certificate expired. And then the panic. I’m not in Dublin; I’m not in Ireland, I’m not in Europe even. I’m in Washington, DC at LISA ‘13, without either my Italian or US phone number, without my client certificate, which was restricted to my Dell laptop which is sitting in my living room in Dublin, and of course, no longer living in Italy!

Thankfully, the StartSSL support are great guys, and while they couldn’t verify me for a Class 2 as I was before right away, I got at least further enough to be able to get new Class 1 certificates, and start the process for Class 2 re-verification. Unfortunately, Class 1 means that I can’t have multiple hostnames for the cert, or even wildcard certificates. So I decided to bit the bullet and go with SNI certificates, which basically means that each vhost now has its own certificate. Which is fine, just a bit more convoluted to set up, as I had to create a number of Certificate Signature Request (CSR) as letting StartSSL generate the keys as 4096 bit SHA-256 RSA takes a very long time.

Unfortunately, SNI means that there are a few people who won’t be able to access my blog any more, although most of them were already disallowed from commenting thanks to my ModSecurity Ruleset as they would be Windows XP with Internet Explorer (any version, my ruleset would only stop IE6 from commenting). There probably are some issues for people stuck with Android 2 and the default browser. I’m sorry for you guys, I think Opera Mobile would work fine for it, but feel free to scream at me that being the case.

Unfortunately, there seems to be trouble with Firefox and with Safari at this point: both these browsers enabled OCSP by default quite a while ago, but newly minted certificates from StartSSL will fail the OCSP check for a few hours. Also there seems to be an issue with Firefox on Android, where SNI is not supported, or maybe it’s just the same OCSP problem which leads to a different error message, I’m not sure. Chrome, Safari on iOS and Opera all work fine.

What still needs to be found out is whether Planet Gentoo and NewsBlur will handle this properly. I’m not sure yet but I’m sure I’ll find out pretty soon. Some offline RSS readers could also not support SNI — that being the case, rather than just complaining to me, let upstream know that they are broken, I’m sure somebody is going to have a good fun with that.

Before somebody points out I should have alerts about certificate expiration, yes I know. I used to have these set up on the Icinga instance that was used by my previous employer, but ever since I haven’t set up anything new for that. I’m starting to do so as we speak, by building Icinga for my Puppetmaster host. I’m also going to write on my calendar to make sure to update the certificates before they expires, as for the OCSP problem noted above.

Questions and comments are definitely welcome, suggestions on how to make things better are too, and if you use Flattr remember to use your email address, as good suggestions will be rewarded!

Munin and IPv6

Okay here it comes another post about Munin for those who are using this awesome monitoring solution (okay I think I’ve been involved in upstream development more than I expected when Jeremy pointed me at it). While the main topic of this post is going to be IPv6 support, I’d like first to spend a few words for context of what’s going on.

Munin in Gentoo has been slightly patched in the 2.0 series — most of the patches were sent upstream the moment when they were introduced, and most of them have been merged in for the following release. Some of them though, including the one bringing my FreeIPMI plugin to replace the OpenIPMI plugins, or at least the first version of it, and those dealing with changes that wouldn’t have been kosher for other distributions (namely, Debian) at this point, were also not merged in the 2.0 branch upstream.

But now Steve opened a new branch for 2.0, which means that the development branch (Munin does not use the master branch, for a simple logistic reason of having a master/ directory in GIT I suppose) is directed toward the 2.1 series instead. This meant not only that I can finally push some of my recent plugin rewrites but also that I could make some more deep changes to it, including rewriting the seven asterisk plugins into a single one, and work hard on the HTTP-based plugins (for web servers and web services) so that they use a shared backend, like SNMP. This actually completely solved an issue that, in Gentoo, we solved only partially before — my ModSecurity ruleset blacklists the default libwww-perl user agent, so with the partial and complete fix, Munin advertises itself in the request; with the new code it includes also the plugin that is currently making the request so that it’s possible to know which requests belongs to what).

Speaking of Asterisk, by the way, I have to thank Sysadminman for lending me a test server for working on said plugins — this not only got us the current new Asterisk plugin (7-in-1!) but also let me modify just a tad said seven plugins, so that instead of using Net::Telnet, I could just use IO::Socket::INET. This has been merged for 2.0, which in turn means that the next ebuild will have one less dependency, and one less USE flag — the asterisk flag for said ebuild only added the Net::Telnet dependency.

To the main topic — how did I get to IPv6 in Munin? Well, I was looking at which other plugins need to be converted to “modernity” – which to me means re-using as much code possible, collapse multiple plugins in one through multigraph, and support virtual-nodes – and I found the squid plugins. This was interesting to me because I actually have one squid instance running, on the tinderbox host to avoid direct connection to the network from the tinderboxes themselves. These plugins do not use libwww-perl like the other HTTP plugins, I suppose (but I can’t be sure, for what I’m going to explain in a moment) because the cache://objects request that has to be done might or might not work with the noted library. Since as I said I have a squid instance, and these (multiple) plugins look exactly like the kind of target that I was looking for to rewrite, I started looking into them.

But once I started, I had a nasty surprise: my Squid instance only replies over IPv6, and that’s intended (the tinderboxes are only assigned IPv6 addresses, which makes it easier for me to access them, and have no NAT to the outside as I want to make sure that all network access is filtered through said proxy). Unfortunately, by default, libwww-perl does not support accessing IPv6. And indeed, neither do most of the other plugins, including the Asterisk I just rewrote, since they use IO::Socket::INET (instead of IO::Socket::INET6). A quick searching around, and this article turned up — although then this also turned up that relates to IPv6 support in Perl core itself.

Unfortunately, even with the core itself supporting IPv6, libwww-perl seems to be of different ideas, and that is a showstopper for me I’m afraid. At least, I need to find a way to get libwww-perl to play nicely if I want to use it over IPv6 (yes I’m going to work this around for the moment and just write the new squid plugins against the IPv4). On the other hand, using IO::Socket::IP would probably solve the issue for the remaining parts of the node and that will for sure at least give us some better support. Even better, it might be possible to abstract and have a Munin::Plugin::Socket that will fall-back to whatever we need. As it is, right now it’s a big question mark of what we can do there.

So what can be said about the current status of IPv6 support in Munin? Well, the Node uses Net::Server, and that in turn is not using IO::Socket::IP, but rather IO::Socket::INET or INET6 if installed — that basically means that the node itself will support IPv6 as long as INET6 is installed, and would call for using it as well, instead of using IO::Socket::IP — but the latter is the future and, for most people, will be part of the system anyway… The async support, in 2.0, will always use IPv4 to connect to the local node. This is not much of a problem, as Steve is working on merging the node and the async daemon in a single entity, which makes the most sense. Basically it means that in 2.1, all nodes will be spooled, instead of what we have right now.

The master, of course, also uses IPv6 — via IO::Socket::INET6 – yet another nail in the coffin of IO::Socket::IP? Maybe. – this covers all the communication between the two main components of Munin, and could be enough to declare it fully IPv6 compatible — and that’s what 2.0 is saying. But alas, this is not the case yet. On an interesting note, the fact that right now Munin supports arbitrary commands as transports, as long as they provide an I/O interface to the socket, make the fact that it supports IPv6 quite moot. Not only you just need an IPv6-capable SSH to handle it, but you can probably use SCTP instead of TCP simply by using a hacked up netcat! I’m not sure if monitoring would get any improvement of using SCTP, although I guess it might overcome some of the overhead related to establishing the connection, but.. well it’s a different story.

Of course, Munin’s own framework is only half of what has to support IPv6 for it to be properly supported; the heart of Munin is the plugins, which means that if they don’t support IPv6, we’re dead in the water. Perl plugins, as noted above, have quite a few issues with finding the right combination of modules for supporting IPv6. Bash plugins, and indeed any other language that could be used, would support IPv6 as good as the underlying tools — indeed, even though libwww-perl does not work with IPv6, plugins written with wget would work out of the box, on an IPv6-capable wget… but of course, the gains we have by using Perl are major enough that you don’t want to go that route.

All in all, I think what’s going to happen is that as soon as I’m done with the weekend’s work (which is quite a bit since the Friday was filled with a couple of server failures, and me finding out that one of my backups was not working as intended) I’ll prepare a branch and see how much of IO::Socket::IP we can leverage, and whether wrapping around that would help us with the new plugins. So we’ll see where this is going to lead us, maybe 2.1 will really be 100% IPv6 compatible…

Munin, sensors and IPMI

In my previous post about Munin I said that I was still working on making sure that the async support would reach Gentoo in a way that actually worked. Now with version 2.0.7-r5 this is vastly possible, and it’s documented on the Wiki for you all to use.

Unfortunately, while testing it, I found out that one of the boxes I’m monitoring, the office’s firewall, was going crazy if I used the async spooled node, reporting fan speeds way too low (87 RPMs) or way too high (300K), and with similar effects on the temperatures as well. This also seems to have caused the fans to go out of control and run constantly at their 4KRPM instead of their usual 2KRPM. The kernel log showed that there was something going wrong with the i2c access, which is what the sensors program uses.

I started looking into the sensors_ plugin that comes with Munin, which I knew already a bit as I fixed it to match some of my systems before… and the problem is that for each box I was monitoring, it would have to execute sensors six times: twice for each graph (fan speed, temperature, voltages), one for config and one for fetching the data. And since there is no way to tell it to just fetch some of the data instead of all of it, it meant many transactions had to go over the i2c bus, all at the same time (when using munin async, the plugins are fetched in parallel). Understanding that the situation is next to unsolvable with that original code, and having one day “half off” at work, I decided to write a new plugin.

This time, instead of using the sensors program, I decided to just access /sys directly. This is quite faster and allows to pinpoint what data you need to fetch. In particular during the config step, there is no reason to fetch the actual value, which saves many i2c transactions even just there. While at it, I also made it a multigraph plugin, instead of the old wildcard one, so that you only need to call it once, and it’ll prepare, serially, all the available graphs: in addition to those that were supported before, which included power – as it’s exposed by the CPUs on Excelsior – I added a few that I haven’t been able to try but are documented by the hwmon sysfs interface, namely current and humidity.

The new plugin is available on the contrib repository – which I haven’t found a decent way to package yet – as sensors/hwmon and is still written in Perl. It’s definitely faster, has fewer dependencies and it’s definitely more reliable at leas ton my firewall. Unfortunately, there is one feature that is missing: sensors would sometimes report an explicit label for temperature data.. but that’s entirely handled in userland. Since we’re reading the data straight from the kernel, most of those labels are lost. For drivers that do expose those labels, such as coretemp, they are used, though.

Also we lose the ability to ignore the values from the get-go, like I described before but you can’t always win. You’ll have to ignore the graph data from the master instead. Otherwise you might want to find a way to tell the kernel to not report that data. The same probably is true for the names, although unfortunately…

[temp*_label] Should only be created if the driver has hints about what this temperature channel is being used for, and user-space doesn’t. In all other cases, the label is provided by user-space.

But I wouldn’t be surprised if it was possible to change that a tinsy bit. Also, while it does forfeit some of the labeling that the sensors program do, I was able to make it nicer when anonymous data is present — it wasn’t so rare to have more than one temp1 value as it was the first temperature channel for each of the (multiple) controllers, such as the Super I/O, ACPI Thermal Zone, and video card. My plugin outputs the controller and the channel name, instead of just the channel name.

After I’ve completed and tested my hwmon plugin I moved on to re-rewrite the IPMI plugin. If you remember the saga I first rewrote the original ipmi_ wildcard plugin in freeipmi_, including support for the same wildcards as ipmisensor_, so that instead of using OpenIPMI (and gawk), it would use FreeIPMI (and awk). The reason was that FreeIPMI can cache SDR information automatically, whereas OpenIPMI does have support, but you have to tackle it manually. The new plugin was also designed to work for virtual nodes, akin to the various SNMP plugins, so that I could monitor some of the servers we have in production, where I can’t install Munin, or I can’t install FreeIPMI. I have replaced the original IPMI plugin, which I was never able to get working on any of my servers, with my version on Gentoo for Munin 2.0. I expect Munin 2.1 to come with the FreeIPMI-based plugin by default.

Unfortunately, like for the sensors_ plugin, my plugin was calling the command six times per host — although this allows you to filter for the type of sensors you want to receive data for. And that became even worse when you have to monitor foreign virtual nodes. How do I solve that? I decided to rewrite it to be multigraph as well… but shell script then was difficult to handle, which means that it’s now also written in Perl. The new freeipmi, non-wildcard, virtual node-capable plugin is available in the same repository and directory as hwmon. My network switch thanks me for that.

Of course unfortunately the async node still does not support multiple hosts, that’s something for later on. In the mean time though, it does spare me lots of grief and I’m happy I took the time working on these two plugins.

Asynchronous Munin

If you’re a Munin user in Gentoo and you look at ChangeLogs you probably noticed that yesterday I did commit quite a few changes to the latest ~arch ebuild of it. The main topic for these changes was async support, which unfortunately I think is still not ready yet, but let’s take a step back. Munin 2.0 brought one feature that was clamored for, and one that was simply extremely interesting: the former is the native SSH transport, the others is what is called “Asynchronous Nodes”.

On a classic node whenever you’re running the update, you actually have to connect to each monitored node (real or virtual), get the list of plugins, get the config of each plugin (which is not cached by the node), and then get the data for said plugin. For things that are easy to get because they only require you to get data out of a file, this is okay, but when you have to actually contact services that take time to respond, it’s a huge pain in the neck. This gets even worse when SNMP is involved, because then you have to actually make multiple requests (for multiple values) both to get the configuration, and to get the values.

To the mix you have to add that the default timeout on the node, for various reason, is 10 seconds which, as I wrote before makes it impossible to use the original IPMI plugin for most of the servers available out there (my plugin instead seem to work just fine, thanks to FreeIPMI). You can increase the timeout, even though this is not really documented to begin with (unfortunately like most of the things about Munin) but that does not help in many cases.

So here’s how the Asynchronous node should solve this issue: on a standard node, the requests to the single node are serialized so you’re actually waiting for each to complete before the next one is fetched, as I said, and since this can make the connection to the node take, all in all, a few minutes, and if the connection is severed then, you lose your data. The Asynchronous node, instead, has a different service polling the actual node on the same host, and saves the data in its spool file. The master in this case connects via SSH (it could theoretically work using xinetd but neither me nor Steve care about that), launches the asynchronous client, and then requests all the data that was fetched since the last request.

This has two side-effects: the first is that your foreign network connection is much faster (there is no waiting for the plugins to config and fetch the data), which in turn means that the overall munin-update transaction is faster, but also, if for whatever reason the connection fails at one point (a VPN connection crashes, a network cable is unplugged, …), the spooled data will cover the time that the network was unreachable as well, removing the “holes” in the monitoring that I’ve been seeing way too often lately. The second side effect is that you can actually spool data every five minutes, but only request it every, let’s say, 15, for hosts which does not require constant monitoring, even though you want to keep granularity.

Unfortunately, the async support is not as tested as it should be and there are quite a few things that are not ironed out yet, which is why the support for it in the ebuild has been this much in flux up to this point. Some things have been changed upstream as well: before, you had only one user, and that was used for both the SSH connections and for the plugins to fetch data — unfortunately one of the side effect of this is that you might have given your munin user more access (usually read-only, but often times there’s no way to ensure that’s the case!) to devices, configurations or things like that… and you definitely don’t want to allow direct access to said user. Now we have two users, munin and munin-async, and the latter needs to have an actual shell.

I tried toying with the idea of using the munin-async client as a shell, but the problem is that there are no ways to pass options to it that way so you can’t use --spoolfetch which makes it vastly useless. On the other hand, I was able to get the SSH support a bit more reliable without having to handle configuration files on the Gentoo side (so that it works for other distributions as well, I need that because I have a few CentOS servers at this point), including the ability to use this without requiring netcat on the other side of the SSH connection (using one old trick with OpenSSH). But this is not yet ready, it’ll have to wait for a little longer.

Anyway as usual you can expect updates to the Munin page on the Gentoo Wiki when the new code is fully deployed. The big problem I’m having right now is making sure I don’t screw up with the work’s monitors while I’m playing with improving and fixing Munin itself.

Munin again, sorry!

Okay this might start to be boring, but I’m still working on Munin, and that means you have to read (or not) another post on the topic.

So last time I was talking about Munin and I was intending to write about SNMP, but with one thing and another I ended up just writing about IPMI.

The only thing I want to put in clear now about SNMP is that I’m still working on it — the main issue seems to be that the Munin plugins have a default timeout of 10 seconds, but the multigraph plugin that should be used for mapping multiple SNMP interface takes about 24 seconds for the switch I’d like to monitor here at my workplace. This should technically be solvable by using the new async daemon support, but this requires, in turn, support for the SSH transport, which can’t be used without relaxing the security applied to munin user (by making it loggable). There is also another point I have to make: I intend to modify the plugin and make a second one that actually allows to graph all the interfaces on one single entry, so that it actually allows me for more interesting data.

For what concerns IPMI instead: the new version of Munin 2 in Gentoo replaces both the IPMI plugins (ipmi_ and ipmi_sensor_) with my version based off FreeIPMI, which now not only outperforms the original ones (thanks to FreeIPMI caching, which is enabled by default), but also outfeatures the original plugins! Here’s the gist of it:

  • by using FreeIPMI, the plugin is shorter, much shorter, as the output is malleable to script handling;
  • I’ve made the script accept both the names used by ipmi_ and ipmi_sensor_ — both reported the same data, but one used bash and gawk (the GNU version only), while the other used python;
  • thanks to Kent I was able to get enough data to make it report power and current as well as temperatures and fans; I still have to implement voltage that was not implemented in the previous plugins either;
  • thanks to Albert, the new 1.2 series of FreeIPMI has support for threshold output — which was the remaining missing feature; I’ve implemented it in the plugin and I’ve patched 1.1.6 (and 1.1.7) to support the option as well;
  • and since I cared … the new version uses POSIX-compatible sh syntax and POSIX awk syntax instead of the GNU variants of both.

My next development is going to be supporting what they call “foreign hosts” on the plugin, so you can actually get Munin to monitor an IPMI SBMC instead of the local BMC interface. This will probably come soonish in Gentoo together with the support for asyncd.

What remains now is finding a way to package the contributed plugins which needs to be available, especially since Steve said he doesn’t want new plugins in the main package, and everything else has to come through the new contrib repository.

And yes, I still have to fix HTML generations, Justin, I know. I’m trying to find what the heck is wrong with it.

Munin, SNMP and IPMI

You might wonder why I’m working so much on Munin, given that I should be worried about other things such as my own job… well, turns out that I’m using Munin at my job and my bosses are so impressed with actually having a monitor on the resource (while they originally sceptical we needed it), that I can easily cut down more time on the payroll to work on improvements.

The funny thing is that we’re actually developing proprietary software based on Free Software components (don’t worry, we respect the licenses!) but our main FLOSS contributions are probably outside the area of development for which our business is based on. I guess this is just the way it is, after all, the open source contributions of Facebook have little to do with social networking.

Anyway while trying to set up monitor for the servers and devices we care about, I was trying to solve the issue of knowing what the sensors reading are on the servers, which are mostly HP (with the exception of my Excelsior, which is monitored, yes, but on a different Munin anyway) and for which lm_sensors is useless: the data is not fed to the main system but rather to the IPMI management board.

Thankfully, there are a number of different tools that allow you to access that IPMI data, and Munin already had a plugin, ipmi_ that uses ipmitool to fetch the sensors’ data. The problem with it is that the fetching takes time, and if the plugin doesn’t reply in 10 seconds, Munin will consider it as not available. On the HP server I started setting this up, the reply time is well over 10 seconds.

To get around this limitation, I decided to take a different approach: I wrote my own plugin. Or rather I rewrote the original plugin using FreeIPMI which I know well (I maintain it in Gentoo and I sent patches upstream before). And this seems to be a win on all counts.

First of all, the ipmi-sensors command caches the so-called SDR data, which describe the sensors available on a system, which means that after the first execution, it doesn’t spend as much time parsing what it’s receiving. Then, since version 1 at least, it has a number of parameters that make it very easy to filter the output and receive it in a format that is suitable for script-based parsing. In particular you can filter what type of sensors you want data from (Temperature or Fan), ignore the values that are not available (e.g.: missing fans), all together with having the output in CSV form.

The net result is that instead of a page-long gawk script to parse the lines, filter them, generate unique names and so on, I’m using a couple of awk commands — yes I tested them with mawk, but they also use a very very simple syntax, so I’m not surprised it’ll work for a very long time. Basically now the plugin is very fast, very short, and very simple. And instead of just expecting to always have both temperature and fans data, this time it actually checks before suggesting anything at all.

Unfortunately, right now, it has two missing features that are present in the original plugin: the first is critical and warning thresholds that are not printed by the current version of FreeIPMI. This is okay, because likely the next release will have a switch to print those as well (which means that it’ll reach feature parity for those two sensors). The other issue is that the original plugin has an undocumented support for power metering.

Unfortunately none of my boards support power metering so I’m stuck without that kind of data, and I can’t be sure how to implement it back right now — the ticket refers that HP servers have the data, but none of the ones I have here report it. But they do report a few temperatures…

Munin graph of FreeIPMI-reported temperatures.

But with the exception of that, the plugin is much better than the one that was there before. Hopefully at some point it’ll be in the default Munin set instead of just being available for Gentoo users.

SITREP — Munin

You might be wondering where I disappeared given that I haven’t written for over a week. Knowing me it might have been bad, but luckily the situation is not that negative. I’m actually back in the States for a visit given I missed this year’s H1B lottery.

Now while I can’t figure out a permanent presence here, I’ve started working on a few tasks, including working with Munin to figure out the load on our current system. Thankfully, now that Munin is developed on Git, it’s dead easy to backport fixes, and to send new ones upstream. Indeed, the 2.0.2-r2 version that is in Gentoo is a little bit more stable and usable than the upstream-released one thanks to it. The one thing that I haven’t been able to work on yet as much as I want is supporting IPv6 nodes.

In particular, if you add a node to the Munin master using a hostname as address, and the hostname resolves as both A and AAAA, with version 2 it’ll try the IPv6 address and that will time out, because the node by default is only listening to IPv4 (on — for whatever reason, the default config has an open listener and authorisation for localhost only, which usually is not what you want). For node IPv6 support, you need the new Net::Server for which an ebuild is present but it’s not (yet) in tree.

Now, in this new version I wired in a good support for the Java-based plugin — which is basically just a way to connect and monitor remote JMX support. The problem with this plugin is that it’s designed to monitor Tomcat and only Tomcat — and is not really wildcarded: you can choose between one of many possible plugin loggers, but they do not let you choose a custom value — this makes it hard for me to use since I want to monitor some custom data out of a Java app that uses JMX by default. So I guess I’ll soon be spending time working in Java, whooooo.

Still talking about Munin, you will probably soon have a new revision bump with an added syslog USE flag that adds a Log::Syslog dependency as well as setting up the configuration files to use syslog — I really dislike having too many files for logs around, especially when metalog is there for that.

I guess this is all for what concerns Munin, for now.

Munin and lm_sensors

I’ve already posted about some munin notes before, when I had to fight with the hddtemp_smartctl plugin and with the bogus readings on my frontend’s sensors output. Today I’ll write a few more notes related to the sensors_ plugin, which heavily tie into lm_sensors territory.

Beside simply monitoring and graphing the input data, Munin provides support for notifying values that get too high, or too low, and that might require direct action. When possible, these values are provided by the plugin itself, which means, for what concerns lm_sensors that the same min/max values as reported by the sensors command will be used, something like this:

Adapter: ISA adapter
in0:          +3.06 V  (min =  +2.99 V, max =  +2.28 V)  ALARM
in1:          +3.06 V  (min =  +2.27 V, max =  +1.20 V)  ALARM
in2:          +1.13 V  (min =  +1.79 V, max =  +1.20 V)  ALARM
in3:          +2.94 V  (min =  +1.88 V, max =  +2.53 V)  ALARM
in4:          +2.71 V  (min =  +1.03 V, max =  +0.80 V)  ALARM
in5:          +2.86 V  (min =  +3.05 V, max =  +1.93 V)  ALARM
in6:          +1.46 V  (min =  +0.78 V, max =  +2.54 V)
3VSB:         +4.08 V  (min =  +4.20 V, max =  +2.66 V)  ALARM
Vbat:         +3.43 V  
fan2:        3901 RPM  (min =  121 RPM)
temp1:        +85.0°C  (low  = -61.0°C, high = +100.0°C)  sensor = thermal diode
temp2:        +79.0°C  (low  = +115.0°C, high = +123.0°C)  sensor = thermistor

As you can see from the output, not always the values are very meaningful: the box (which is the same AT5IONT-I I referred to in the previous post) turns off much sooner than reaching those temperatures: as soon as temp2 reaches 90°C it is already too late, and having a fan run at 121 RPM would mean that the box is gone for good.

I usually don’t care much about those values as I keep an eye on the boxes’ health at least once a day, lately a bit more because I’m monitoring a new production server for a customer. And my main home router always reported issues with its fans, even though the fans themselves where running properly. I never cared much because I knew it was a fluke there.

But when today, after the summer holidays, another customer’s backup server started showing warnings in temperature, I was much more worried — after all, these are very hot days and I’m myself almost going to pass out because of it. The warning turned out to be another fluke: I updated the server to kernel 2.6.39 before holidays, which has a driver for the sensor in that box.. unfortunately the sensors output reported 0°C as the critical temperature, and munin assumed it was bad.

While there are one or two ways to work this around on Munin side, I found it more solid to actually resolve the issue on the sensors themselves. You can do that by creating a /etc/sensors.d/local file, with a few directives for the sensors modules to handle:

chip "nct6775-isa-0680"
     set temp1_max 80
     set temp1_max_hyst 75

But just creating this file is not enough to actually have it working, if you tried it and it failed: you have to tell the kernel about it, because the data sensors displays, it simply fetches straight from the kernel. This is done through sensors -s or through the lm_sensors init script if you have INITSENSORS=yes in /etc/conf.d/lm_sensors.

What was the problem with my router’s sensors then? Well, one of the fans’ minimum value was set to over twenty thousands rotations per minute.. it was just a matter of resetting it to 1500 (for a fan that averages on 2000 RPM it should be okay to warn at that point) and it works quite nicely without warning me.

I’m actually now considering to spend some more time to get the limits set to actually relevant values so that I would be warned in case of a serious issue, but for now I think that just not having a visual clue for a fluke would be enough.

Finally, I wish to thank Steve Schnepp for both implementing native IPv6 support in the upcoming Munin 2.0 version (which should simplify a lot the current mess of aggregating data for the monitored hosts on my system, together with the native SSH transport), and for merging the three patches I sent them, which will be part of the new release.

Munin no[dt]es

Back when I was looking at entropy I took Jeremy’s suggestion and started setting up Munin on my boxes to have an idea of their average day situation. This was actually immensely useful to note that the EntropyKey worked fine, but the spike in load and processes caused entropy depletion nonetheless. After a bit of time spent tweaking, trying, and working with munin, I now have a few considerations I’d like to make you a part of.

First of all, you might remember my custom ModSecurity ruleset (which seems not to pick the interest of very many peopole). One of the things that my ruleset filters is requests coming from user-agents only reporting the underlying library version (such as libCurl, http-access2, libwww-perl, …), as most of the time they seem to be bound to be custom scripts or bots. As it happens, even the latest sources for Munin do not report themselves as being part of anything “bigger”, with the final result that you can’t monitor Apache when my ModSecurity ruleset is present, d’oh!

Luckily, Christian helped me out and provided me with a patch (that is attached to bug #370131 which I’ll come back to later) that makes munin plugins show themselves as part of munin when sending the requests, which stops my ruleset from rejecting the request altogether.

In the bug you’ll also find a patch that changes the apc_nis plugin, used to request data to APC UPSes though apcupsd, to graph one more variable, the UPS internal temperature. This was probably ignored before because it is not present on most low-end series, such as the Back-UPS, but it is available in my SmartUPS, which last year did indeed shut down abruptly, most likely because of overheating.

Unfortunately it’s not always as easy to fix the trouble; indeed one of the most obnoxious, for me, issues is that Munin does not support using IPv6 to connect to nodes. Unfortunately the problem lies in the Perl Net-Server module, and there is no known solution, barring some custom distribution patches — which I’d rather not ask to implement in Gentoo. For now I solved this by using ssh with port forwarding over address, which is not really nice, but works decently for my needs.

Another interesting area relates to sensors handling: Munin comes with a plugin to interpret the output of the sensors program from lm_sensors, which is tremendously nice to make sure that a system is not overheating constantly. Unfortunately this doesn’t always work as good. Turns out that the output format expected by Munin has changed quite a bit in the latest versions, namely the min/max/hyst tuple of extra values vary their name depending on the used chip. Plus it doesn’t take into consideration the option that one of the sensors is altogether disabled.

This last problem is what hit me on Raven (my frontend system). The sensors of the board – Asus AT5IONT-I – were not supported on Linux up to version 2.6.38; I could only watch over the values reported by coretemp, the Intel CPU temperature sensor. With versions 2.6.39, driver it8721 finally supported the board and I could watch over all the temperature values. But of the three temperature values available from the sensor, only two are actually wired, respectively to a thermal diode and a thermistor; the third results disabled, but is still reported by the sensors output with a value of –128°C which Munin then graphs. The only way I found to disable that, was to create a local sensors.d configuration file to stop the value from appearing.

One interesting note relates to the network I/O graphing: Munin didn’t seem to take very nice the rollover at the reboot of my router: it reported a peak of petabits I/O at the time. The problem turns out to be easy to solve, but not as easy to guess. When reading the ifconfig output, the if_ plugin does not discard any datapoint, even if it is bigger than the speed of any given interface, unless it can read the interface’s real speed through mii-tool. Unfortunately it can only do that if the plugin is executed as root, which obviously is not the default.

Another funny plugin is hddtemp_smartctl — rather than using the hddtemp command itself, it uses smartctl which is part of smartmontools, which I most definitely keep around more often than the former. Unfortunately it also has limitations; the most obnoxious is that it doesn’t allow you to just list a bunch of device paths. Instead you have to first provide it with a set of device names, and then you can override their default device path; this is further complicated by the fact that each of the paths is forced to have /dev/ prefixed. Since Yamato has a number of devices, and not always they seem to get the same letters, I had to set it up this way:

user root
group disk
env.drives sd1 sd2 sd3 sd4 sd5 sd6
env.dev_sd1 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT15838155
env.dev_sd2 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT16241483
env.dev_sd3 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4499656
env.dev_sd4 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4517732
env.dev_sd5 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2720459
env.dev_sd6 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2721970

For those curious, the temperatures vary between 39°C for sd6 and 60°C for sd3.

I have to say I’m very glad I started using Munin, as it helps understanding a few important things, among which is the fact that I need to separate my storage to a separate system, rather than delegating everything to Yamato (I’ll do so as soon as I have a real office and some extra cash), and that using a bridge to connect the virtualised guests to my home network is not a good idea (having Yamato with the network card in promiscuous mode means that all the packets are received by it, even when they are directed to the access point connecting me to the router downstairs).

Monitoring a single server

If you follow my delicious you might have noticed some recently tagged content about Ruby and Gtk+. As you might guess, I’m going to resume working with Ruby and in particular I’m going to write a graphical application using Ruby-Gtk2.

The problem I’mt rying to solve is related to the downtime I had; the problem is that I cannot stay logged in in SSH with top open at any time of the day in my vserver to make sure everything is alright, and thus I ended up having some trouble because a script possibly went haywire (I’m not sure whether it went haywire before or after the move of the vserver to new hardware).

Since using Nagios is a bit of an overkill, considering I have to monitor a single box and I don’t want to keep looking at something (included my email), I’ve decided that the solution is writing a desktop application that will monitor the status of the box and notify me right away that something is not going as it should. Now of course this is a very nice target but a difficult one to achieve, to start with “how the heck do you get the data out of the box?”.

Luckily, for my High School final exam I presented a software that already was a stake to the solution, ATMOSphere (yes I know the site is lame and the project is well dead), which was a software to monitor and configure my router, a D-Link DSL-500 (Generation I) that used as operating system ATMOS (by GlobespanVirata I still have the printed in-depth manuals for the complex CLI interface it had, both serial and telnet protocol based); together with the CLI protocol for setting up basic parameters, I used the SNMP to read most parameters out of it. This is the reason why you might find my name related to a library called libksnmp; that library was a KDE-like interface to the net-snmp library (which was at least at the time a mess to develop with), which I used not only for ATMOSphere, but also for KNetStat to access remote interfaces (like the one of my router); since then I haven’t worked with SNMP at all, albeit I’m sure my current router also supports it.

Despite being called (Anything but — ) Simple Network Management Protocol I’d expect SNMP to be much more often used for querying rather than actually manage, especially considering the bad excuse of an authentication system that was included in the first two versions (not like the one included in version 3 is much better). Also it’s almost certainly a misnomer since the OID approach is probably one of the worst one I’ve seen in my life for a protocol. But beside this, the software is very well present (net-snmp) and nowadays there is a decent client library too, in Ruby, which makes it possible to write monitoring software relatively quickly.

My idea was to just write up something that sits in my desktop tray, querying on a given interval the server for its status, the nice thing here would be being able notify me as soon as there’s a problem, by both putting a big red icon in my tray and by showing up a message through libnotify to tell me what the problem is. This would allow me to know immediately if something went haywire. The problem is: how do you define “there’s a problem”? This is the part I’m trying to solve right now.

While SNMP specifications allows to set errors, so you could just tell snmpd when to report there’s an error, so that it was not the agent but the server to know when to report problems, which is very nice since you just need to configure it on the server and even if you change workstation you’ll have the same parameters; unfortunately this has limited scope: on most routers or SoHo network equipment you won’t find much configuration for SNMP, the D-Link ones, albeit supporting SNMP quite well, didn’t advertise it on the manual nor had configuration options on the wepages, the 3Com I have now has some configuration for SNMP traps and has support for writing through SNMP (luckily, disabled by default); I guess I’ll have to add support for writing at least some parameters so I could set up devices like these (that supports writing to SNMP to set up the alarms). But for those who also lack writing support, I suppose the only way would be to add some support for client-side rules that tells the agent when to issue a warning. I guess that might be a further extension.

Right now I’m a bit at a stop because the version of Ruby-Gtk2 in portage does not support GtkBuilder, which makes writing the interface quite a bit of an issue, but once the new version will be in, I’ll certainly be working on something to apply there. In the mean time, I’m open to suggestions as to other monitoring applications that might save me from writing my own, or in ideas on how I could approach the problems that will present themselves. I think at least I’ll be adding some drop-down widget like the one for the worldclock in Gnome (where the timezones are shown) with graphs of the interface in/out bandwidth use (which would be nice so I could resume monitoring my router too).

Okay for now I suppose I’ll stop here, I’ll wait for the dependencies I’ll need to be in Portage, so maybe someone will find me something better to do and a software that does what I look for.