Munin and lm_sensors

I’ve already posted about some munin notes before, when I had to fight with the hddtemp_smartctl plugin and with the bogus readings on my frontend’s sensors output. Today I’ll write a few more notes related to the sensors_ plugin, which heavily tie into lm_sensors territory.

Beside simply monitoring and graphing the input data, Munin provides support for notifying values that get too high, or too low, and that might require direct action. When possible, these values are provided by the plugin itself, which means, for what concerns lm_sensors that the same min/max values as reported by the sensors command will be used, something like this:

Adapter: ISA adapter
in0:          +3.06 V  (min =  +2.99 V, max =  +2.28 V)  ALARM
in1:          +3.06 V  (min =  +2.27 V, max =  +1.20 V)  ALARM
in2:          +1.13 V  (min =  +1.79 V, max =  +1.20 V)  ALARM
in3:          +2.94 V  (min =  +1.88 V, max =  +2.53 V)  ALARM
in4:          +2.71 V  (min =  +1.03 V, max =  +0.80 V)  ALARM
in5:          +2.86 V  (min =  +3.05 V, max =  +1.93 V)  ALARM
in6:          +1.46 V  (min =  +0.78 V, max =  +2.54 V)
3VSB:         +4.08 V  (min =  +4.20 V, max =  +2.66 V)  ALARM
Vbat:         +3.43 V  
fan2:        3901 RPM  (min =  121 RPM)
temp1:        +85.0°C  (low  = -61.0°C, high = +100.0°C)  sensor = thermal diode
temp2:        +79.0°C  (low  = +115.0°C, high = +123.0°C)  sensor = thermistor

As you can see from the output, not always the values are very meaningful: the box (which is the same AT5IONT-I I referred to in the previous post) turns off much sooner than reaching those temperatures: as soon as temp2 reaches 90°C it is already too late, and having a fan run at 121 RPM would mean that the box is gone for good.

I usually don’t care much about those values as I keep an eye on the boxes’ health at least once a day, lately a bit more because I’m monitoring a new production server for a customer. And my main home router always reported issues with its fans, even though the fans themselves where running properly. I never cared much because I knew it was a fluke there.

But when today, after the summer holidays, another customer’s backup server started showing warnings in temperature, I was much more worried — after all, these are very hot days and I’m myself almost going to pass out because of it. The warning turned out to be another fluke: I updated the server to kernel 2.6.39 before holidays, which has a driver for the sensor in that box.. unfortunately the sensors output reported 0°C as the critical temperature, and munin assumed it was bad.

While there are one or two ways to work this around on Munin side, I found it more solid to actually resolve the issue on the sensors themselves. You can do that by creating a /etc/sensors.d/local file, with a few directives for the sensors modules to handle:

chip "nct6775-isa-0680"
     set temp1_max 80
     set temp1_max_hyst 75

But just creating this file is not enough to actually have it working, if you tried it and it failed: you have to tell the kernel about it, because the data sensors displays, it simply fetches straight from the kernel. This is done through sensors -s or through the lm_sensors init script if you have INITSENSORS=yes in /etc/conf.d/lm_sensors.

What was the problem with my router’s sensors then? Well, one of the fans’ minimum value was set to over twenty thousands rotations per minute.. it was just a matter of resetting it to 1500 (for a fan that averages on 2000 RPM it should be okay to warn at that point) and it works quite nicely without warning me.

I’m actually now considering to spend some more time to get the limits set to actually relevant values so that I would be warned in case of a serious issue, but for now I think that just not having a visual clue for a fluke would be enough.

Finally, I wish to thank Steve Schnepp for both implementing native IPv6 support in the upcoming Munin 2.0 version (which should simplify a lot the current mess of aggregating data for the monitored hosts on my system, together with the native SSH transport), and for merging the three patches I sent them, which will be part of the new release.