Munin and lm_sensors

I’ve already posted about some munin notes before, when I had to fight with the hddtemp_smartctl plugin and with the bogus readings on my frontend’s sensors output. Today I’ll write a few more notes related to the sensors_ plugin, which heavily tie into lm_sensors territory.

Beside simply monitoring and graphing the input data, Munin provides support for notifying values that get too high, or too low, and that might require direct action. When possible, these values are provided by the plugin itself, which means, for what concerns lm_sensors that the same min/max values as reported by the sensors command will be used, something like this:

it8721-isa-0290
Adapter: ISA adapter
in0:          +3.06 V  (min =  +2.99 V, max =  +2.28 V)  ALARM
in1:          +3.06 V  (min =  +2.27 V, max =  +1.20 V)  ALARM
in2:          +1.13 V  (min =  +1.79 V, max =  +1.20 V)  ALARM
in3:          +2.94 V  (min =  +1.88 V, max =  +2.53 V)  ALARM
in4:          +2.71 V  (min =  +1.03 V, max =  +0.80 V)  ALARM
in5:          +2.86 V  (min =  +3.05 V, max =  +1.93 V)  ALARM
in6:          +1.46 V  (min =  +0.78 V, max =  +2.54 V)
3VSB:         +4.08 V  (min =  +4.20 V, max =  +2.66 V)  ALARM
Vbat:         +3.43 V  
fan2:        3901 RPM  (min =  121 RPM)
temp1:        +85.0°C  (low  = -61.0°C, high = +100.0°C)  sensor = thermal diode
temp2:        +79.0°C  (low  = +115.0°C, high = +123.0°C)  sensor = thermistor

As you can see from the output, not always the values are very meaningful: the box (which is the same AT5IONT-I I referred to in the previous post) turns off much sooner than reaching those temperatures: as soon as temp2 reaches 90°C it is already too late, and having a fan run at 121 RPM would mean that the box is gone for good.

I usually don’t care much about those values as I keep an eye on the boxes’ health at least once a day, lately a bit more because I’m monitoring a new production server for a customer. And my main home router always reported issues with its fans, even though the fans themselves where running properly. I never cared much because I knew it was a fluke there.

But when today, after the summer holidays, another customer’s backup server started showing warnings in temperature, I was much more worried — after all, these are very hot days and I’m myself almost going to pass out because of it. The warning turned out to be another fluke: I updated the server to kernel 2.6.39 before holidays, which has a driver for the sensor in that box.. unfortunately the sensors output reported 0°C as the critical temperature, and munin assumed it was bad.

While there are one or two ways to work this around on Munin side, I found it more solid to actually resolve the issue on the sensors themselves. You can do that by creating a /etc/sensors.d/local file, with a few directives for the sensors modules to handle:

chip "nct6775-isa-0680"
     set temp1_max 80
     set temp1_max_hyst 75

But just creating this file is not enough to actually have it working, if you tried it and it failed: you have to tell the kernel about it, because the data sensors displays, it simply fetches straight from the kernel. This is done through sensors -s or through the lm_sensors init script if you have INITSENSORS=yes in /etc/conf.d/lm_sensors.

What was the problem with my router’s sensors then? Well, one of the fans’ minimum value was set to over twenty thousands rotations per minute.. it was just a matter of resetting it to 1500 (for a fan that averages on 2000 RPM it should be okay to warn at that point) and it works quite nicely without warning me.

I’m actually now considering to spend some more time to get the limits set to actually relevant values so that I would be warned in case of a serious issue, but for now I think that just not having a visual clue for a fluke would be enough.

Finally, I wish to thank Steve Schnepp for both implementing native IPv6 support in the upcoming Munin 2.0 version (which should simplify a lot the current mess of aggregating data for the monitored hosts on my system, together with the native SSH transport), and for merging the three patches I sent them, which will be part of the new release.

3 thoughts on “Munin and lm_sensors

  1. I’m pretty sure I have too a AT5ION-1 (is it the nvidia one with the gigantic heatsink?) and I’ve been using the atk0110-acpi to register temperatures and the like. Just invoked sensors and this is the output:——————-Adapter: ACPI interfaceVcore Voltage: +1.12 V (min = +0.85 V, max = +1.60 V) +3.3 Voltage: +3.32 V (min = +2.97 V, max = +3.63 V) +5 Voltage: +5.05 V (min = +4.50 V, max = +5.50 V) +12 Voltage: +12.05 V (min = +10.20 V, max = +13.80 V)CPU FAN Speed: 2500 RPM (min = 600 RPM)CHASSIS FAN Speed: 0 RPM (min = 600 RPM)CPU Temperature: +66.0°C (high = +60.0°C, crit = +95.0°C) GPU Temperature: +57.0°C (high = +60.0°C, crit = +95.0°C)coretemp-isa-0000Adapter: ISA adapterCore 0: +46.0°C (crit = +100.0°C)coretemp-isa-0001Adapter: ISA adapterCore 1: +46.0°C (crit = +100.0°C) —————————I don’t know if this is correct temps though, but i haven’t tweaked anything that i can remember. Is this by any means correct? Or are any values strange? Kernel was 2.6.37.

    Like

  2. Ah I didn’t even consider the atk driver, I always forget it exists. Indeed that would have worked without the correct lm_sensors driver for the chip (and it can coexist!):<typo:code>coretemp-isa-0000Adapter: ISA adapterCore 0: +58.0°C (high = +80.0°C, crit = +100.0°C)Core 1: +60.0°C (high = +80.0°C, crit = +100.0°C)it8721-isa-0290Adapter: ISA adapterin0: +3.06 V (min = +2.80 V, max = +2.09 V) ALARMin1: +3.06 V (min = +2.29 V, max = +2.35 V) ALARMin2: +1.12 V (min = +1.81 V, max = +0.62 V) ALARMin3: +2.93 V (min = +1.79 V, max = +2.53 V) ALARMin4: +2.71 V (min = +1.51 V, max = +0.80 V) ALARMin5: +2.84 V (min = +3.05 V, max = +1.93 V) ALARMin6: +1.46 V (min = +0.78 V, max = +2.54 V)3VSB: +4.08 V (min = +4.20 V, max = +0.94 V) ALARMVbat: +3.46 V fan2: 3879 RPM (min = 121 RPM)temp1: +75.0°C (low = -45.0°C, high = +100.0°C) sensor = thermal diodetemp2: +69.0°C (low = +115.0°C, high = +123.0°C) sensor = thermistoratk0110-acpi-0Adapter: ACPI interfaceVcore Voltage: +1.12 V (min = +0.85 V, max = +1.60 V) +3.3 Voltage: +3.29 V (min = +2.97 V, max = +3.63 V) +5 Voltage: +5.09 V (min = +4.50 V, max = +5.50 V) +12 Voltage: +11.71 V (min = +10.20 V, max = +13.80 V)CPU FAN Speed: 3879 RPM (min = 600 RPM)CHASSIS FAN Speed: 0 RPM (min = 600 RPM)CPU Temperature: +75.0°C (high = +60.0°C, crit = +95.0°C)GPU Temperature: +69.0°C (high = +60.0°C, crit = +95.0°C)</typo:code>

    Like

  3. Thanks, that’s good to know. I was more worried about it having bad readings as the temperatures of your box are a lot higher than mine. I was re-compiling gcc and it’s companions (didn’t use it for a couple months) and in a room that was 30+ Celcius the cpu was reaching 69C before i decided to turn the a/c.Another thing i noticed right now is that the it8721 ‘low temps’ are both negative values or far too high ones.What is the main difference between both drivers? As far as I see it’s only the voltage data that’s treated differently.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s