If you’re a Munin user in Gentoo and you look at ChangeLogs you probably noticed that yesterday I did commit quite a few changes to the latest ~arch ebuild of it. The main topic for these changes was async support, which unfortunately I think is still not ready yet, but let’s take a step back. Munin 2.0 brought one feature that was clamored for, and one that was simply extremely interesting: the former is the native SSH transport, the others is what is called “Asynchronous Nodes”.
On a classic node whenever you’re running the update, you actually have to connect to each monitored node (real or virtual), get the list of plugins, get the config of each plugin (which is not cached by the node), and then get the data for said plugin. For things that are easy to get because they only require you to get data out of a file, this is okay, but when you have to actually contact services that take time to respond, it’s a huge pain in the neck. This gets even worse when SNMP is involved, because then you have to actually make multiple requests (for multiple values) both to get the configuration, and to get the values.
To the mix you have to add that the default timeout on the node, for various reason, is 10 seconds which, as I wrote before makes it impossible to use the original IPMI plugin for most of the servers available out there (my plugin instead seem to work just fine, thanks to FreeIPMI). You can increase the timeout, even though this is not really documented to begin with (unfortunately like most of the things about Munin) but that does not help in many cases.
So here’s how the Asynchronous node should solve this issue: on a standard node, the requests to the single node are serialized so you’re actually waiting for each to complete before the next one is fetched, as I said, and since this can make the connection to the node take, all in all, a few minutes, and if the connection is severed then, you lose your data. The Asynchronous node, instead, has a different service polling the actual node on the same host, and saves the data in its spool file. The master in this case connects via SSH (it could theoretically work using xinetd but neither me nor Steve care about that), launches the asynchronous client, and then requests all the data that was fetched since the last request.
This has two side-effects: the first is that your foreign network connection is much faster (there is no waiting for the plugins to config and fetch the data), which in turn means that the overall
munin-update transaction is faster, but also, if for whatever reason the connection fails at one point (a VPN connection crashes, a network cable is unplugged, …), the spooled data will cover the time that the network was unreachable as well, removing the “holes” in the monitoring that I’ve been seeing way too often lately. The second side effect is that you can actually spool data every five minutes, but only request it every, let’s say, 15, for hosts which does not require constant monitoring, even though you want to keep granularity.
Unfortunately, the async support is not as tested as it should be and there are quite a few things that are not ironed out yet, which is why the support for it in the ebuild has been this much in flux up to this point. Some things have been changed upstream as well: before, you had only one user, and that was used for both the SSH connections and for the plugins to fetch data — unfortunately one of the side effect of this is that you might have given your munin user more access (usually read-only, but often times there’s no way to ensure that’s the case!) to devices, configurations or things like that… and you definitely don’t want to allow direct access to said user. Now we have two users, munin and munin-async, and the latter needs to have an actual shell.
I tried toying with the idea of using the munin-async client as a shell, but the problem is that there are no ways to pass options to it that way so you can’t use
--spoolfetch which makes it vastly useless. On the other hand, I was able to get the SSH support a bit more reliable without having to handle configuration files on the Gentoo side (so that it works for other distributions as well, I need that because I have a few CentOS servers at this point), including the ability to use this without requiring netcat on the other side of the SSH connection (using one old trick with OpenSSH). But this is not yet ready, it’ll have to wait for a little longer.
Anyway as usual you can expect updates to the Munin page on the Gentoo Wiki when the new code is fully deployed. The big problem I’m having right now is making sure I don’t screw up with the work’s monitors while I’m playing with improving and fixing Munin itself.