Munin and lm_sensors

I’ve already posted about some munin notes before, when I had to fight with the hddtemp_smartctl plugin and with the bogus readings on my frontend’s sensors output. Today I’ll write a few more notes related to the sensors_ plugin, which heavily tie into lm_sensors territory.

Beside simply monitoring and graphing the input data, Munin provides support for notifying values that get too high, or too low, and that might require direct action. When possible, these values are provided by the plugin itself, which means, for what concerns lm_sensors that the same min/max values as reported by the sensors command will be used, something like this:

it8721-isa-0290
Adapter: ISA adapter
in0:          +3.06 V  (min =  +2.99 V, max =  +2.28 V)  ALARM
in1:          +3.06 V  (min =  +2.27 V, max =  +1.20 V)  ALARM
in2:          +1.13 V  (min =  +1.79 V, max =  +1.20 V)  ALARM
in3:          +2.94 V  (min =  +1.88 V, max =  +2.53 V)  ALARM
in4:          +2.71 V  (min =  +1.03 V, max =  +0.80 V)  ALARM
in5:          +2.86 V  (min =  +3.05 V, max =  +1.93 V)  ALARM
in6:          +1.46 V  (min =  +0.78 V, max =  +2.54 V)
3VSB:         +4.08 V  (min =  +4.20 V, max =  +2.66 V)  ALARM
Vbat:         +3.43 V  
fan2:        3901 RPM  (min =  121 RPM)
temp1:        +85.0°C  (low  = -61.0°C, high = +100.0°C)  sensor = thermal diode
temp2:        +79.0°C  (low  = +115.0°C, high = +123.0°C)  sensor = thermistor

As you can see from the output, not always the values are very meaningful: the box (which is the same AT5IONT-I I referred to in the previous post) turns off much sooner than reaching those temperatures: as soon as temp2 reaches 90°C it is already too late, and having a fan run at 121 RPM would mean that the box is gone for good.

I usually don’t care much about those values as I keep an eye on the boxes’ health at least once a day, lately a bit more because I’m monitoring a new production server for a customer. And my main home router always reported issues with its fans, even though the fans themselves where running properly. I never cared much because I knew it was a fluke there.

But when today, after the summer holidays, another customer’s backup server started showing warnings in temperature, I was much more worried — after all, these are very hot days and I’m myself almost going to pass out because of it. The warning turned out to be another fluke: I updated the server to kernel 2.6.39 before holidays, which has a driver for the sensor in that box.. unfortunately the sensors output reported 0°C as the critical temperature, and munin assumed it was bad.

While there are one or two ways to work this around on Munin side, I found it more solid to actually resolve the issue on the sensors themselves. You can do that by creating a /etc/sensors.d/local file, with a few directives for the sensors modules to handle:

chip "nct6775-isa-0680"
     set temp1_max 80
     set temp1_max_hyst 75

But just creating this file is not enough to actually have it working, if you tried it and it failed: you have to tell the kernel about it, because the data sensors displays, it simply fetches straight from the kernel. This is done through sensors -s or through the lm_sensors init script if you have INITSENSORS=yes in /etc/conf.d/lm_sensors.

What was the problem with my router’s sensors then? Well, one of the fans’ minimum value was set to over twenty thousands rotations per minute.. it was just a matter of resetting it to 1500 (for a fan that averages on 2000 RPM it should be okay to warn at that point) and it works quite nicely without warning me.

I’m actually now considering to spend some more time to get the limits set to actually relevant values so that I would be warned in case of a serious issue, but for now I think that just not having a visual clue for a fluke would be enough.

Finally, I wish to thank Steve Schnepp for both implementing native IPv6 support in the upcoming Munin 2.0 version (which should simplify a lot the current mess of aggregating data for the monitored hosts on my system, together with the native SSH transport), and for merging the three patches I sent them, which will be part of the new release.

Munin no[dt]es

Back when I was looking at entropy I took Jeremy’s suggestion and started setting up Munin on my boxes to have an idea of their average day situation. This was actually immensely useful to note that the EntropyKey worked fine, but the spike in load and processes caused entropy depletion nonetheless. After a bit of time spent tweaking, trying, and working with munin, I now have a few considerations I’d like to make you a part of.

First of all, you might remember my custom ModSecurity ruleset (which seems not to pick the interest of very many peopole). One of the things that my ruleset filters is requests coming from user-agents only reporting the underlying library version (such as libCurl, http-access2, libwww-perl, …), as most of the time they seem to be bound to be custom scripts or bots. As it happens, even the latest sources for Munin do not report themselves as being part of anything “bigger”, with the final result that you can’t monitor Apache when my ModSecurity ruleset is present, d’oh!

Luckily, Christian helped me out and provided me with a patch (that is attached to bug #370131 which I’ll come back to later) that makes munin plugins show themselves as part of munin when sending the requests, which stops my ruleset from rejecting the request altogether.

In the bug you’ll also find a patch that changes the apc_nis plugin, used to request data to APC UPSes though apcupsd, to graph one more variable, the UPS internal temperature. This was probably ignored before because it is not present on most low-end series, such as the Back-UPS, but it is available in my SmartUPS, which last year did indeed shut down abruptly, most likely because of overheating.

Unfortunately it’s not always as easy to fix the trouble; indeed one of the most obnoxious, for me, issues is that Munin does not support using IPv6 to connect to nodes. Unfortunately the problem lies in the Perl Net-Server module, and there is no known solution, barring some custom distribution patches — which I’d rather not ask to implement in Gentoo. For now I solved this by using ssh with port forwarding over address 127.0.0.2, which is not really nice, but works decently for my needs.

Another interesting area relates to sensors handling: Munin comes with a plugin to interpret the output of the sensors program from lm_sensors, which is tremendously nice to make sure that a system is not overheating constantly. Unfortunately this doesn’t always work as good. Turns out that the output format expected by Munin has changed quite a bit in the latest versions, namely the min/max/hyst tuple of extra values vary their name depending on the used chip. Plus it doesn’t take into consideration the option that one of the sensors is altogether disabled.

This last problem is what hit me on Raven (my frontend system). The sensors of the board – Asus AT5IONT-I – were not supported on Linux up to version 2.6.38; I could only watch over the values reported by coretemp, the Intel CPU temperature sensor. With versions 2.6.39, driver it8721 finally supported the board and I could watch over all the temperature values. But of the three temperature values available from the sensor, only two are actually wired, respectively to a thermal diode and a thermistor; the third results disabled, but is still reported by the sensors output with a value of –128°C which Munin then graphs. The only way I found to disable that, was to create a local sensors.d configuration file to stop the value from appearing.

One interesting note relates to the network I/O graphing: Munin didn’t seem to take very nice the rollover at the reboot of my router: it reported a peak of petabits I/O at the time. The problem turns out to be easy to solve, but not as easy to guess. When reading the ifconfig output, the if_ plugin does not discard any datapoint, even if it is bigger than the speed of any given interface, unless it can read the interface’s real speed through mii-tool. Unfortunately it can only do that if the plugin is executed as root, which obviously is not the default.

Another funny plugin is hddtemp_smartctl — rather than using the hddtemp command itself, it uses smartctl which is part of smartmontools, which I most definitely keep around more often than the former. Unfortunately it also has limitations; the most obnoxious is that it doesn’t allow you to just list a bunch of device paths. Instead you have to first provide it with a set of device names, and then you can override their default device path; this is further complicated by the fact that each of the paths is forced to have /dev/ prefixed. Since Yamato has a number of devices, and not always they seem to get the same letters, I had to set it up this way:

[hddtemp_smartctl]
user root
group disk
env.drives sd1 sd2 sd3 sd4 sd5 sd6
env.dev_sd1 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT15838155
env.dev_sd2 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT16241483
env.dev_sd3 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4499656
env.dev_sd4 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4517732
env.dev_sd5 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2720459
env.dev_sd6 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2721970

For those curious, the temperatures vary between 39°C for sd6 and 60°C for sd3.

I have to say I’m very glad I started using Munin, as it helps understanding a few important things, among which is the fact that I need to separate my storage to a separate system, rather than delegating everything to Yamato (I’ll do so as soon as I have a real office and some extra cash), and that using a bridge to connect the virtualised guests to my home network is not a good idea (having Yamato with the network card in promiscuous mode means that all the packets are received by it, even when they are directed to the access point connecting me to the router downstairs).