Some UPS notes (apcupsd et al)

If you didn’t notice, one of the packages I’ve been maintaining in Gentoo is apcupsd that is one of the daemons that can be used to control APC-produced UPS units. But for quite a while, my maintenance of the package was mostly limited to keeping it in a working state (with a wide range of different results, to be honest), since from the original (messy) ebuild I originally inherited, the current one is quite linear and, in my opinion, elegant.

But in the past two weeks or so, a few things happened that required me to look into the package more closely: a version bump, a customer having two UPSes connected to the same system, and the only remaining non-APC UPS in my home office declaring itself dead.

The version bump was the chance for me to finally fix the strict aliasing issue that is still present; this time, instead of simply disabling strict aliasing (quick, hacky way) I decided to look at the code, to make it actually strict aliasing compliant. This might not sound like much, but this kind of warnings is particularly nasty as you never know when it will cause an issue. Besides, it caused Portage to abort in stricter mode, that is what I use for packages I maintain myself.

Also, while my customer’s needs didn’t really influence my work on apcupsd itself, it caused me to look even more into munin’s apc_nis plugin as beforehand it was not configurable at all: it only ever used localhost:3551 to connect to APC NIS interface, which meant that if you wanted to change the port, or make it only listen on an external interface, you were out of luck. The patch to make this configurable is now part of Munin trunk, but I haven’t had time to ask Jeremy to add it to Gentoo as well (the few patches of mine to Munin are all merged upstream now, and Munin 2 will have those, and finally, native IPv6 transport, which means I probably won’t need to use ssh connections to fetch data over NAT, but just properly-configured firewalls).

There is another issue that comes up when having multiple UPS connected to the same box though: permanence of device names. While the daemon auto-discovers a single connected APC device, when you have multiple devices you need to explicitly tell it to access a given one. To do so, you could use the hiddev device paths, but the kernel does not make those persistent if you connect/disconnect the units. To solve this issue, the new ebuild for apcupsd that I committed today uses udev rules to provide /etc/apcups/usb-${SERIALNO} symlinks that you can use to provide stable reference to your apcupsd instances. I sent the rules upstream, hoping that they’ll be integrated in the next release.

A note here: while I’m a fan of autoconfiguration, I’m having trouble considering the idea of having apcupsd auto-started when an APC UPS is connected. The reason is not obvious though: while it would work fine if it was the only UPS and thus the only apcupsd instance to be present, if you had a second instance set up for a different UPS there would be no way to match the two together. This is at a minimum bothersome.

Speaking about init scripts, the powerfail init script currently only works in single-UPS configurations (whereas the main init script works fine in multiple UPS configurations), and even there it is a bit … broken. The powerfail flag can be written in a number of different places – the default and the Gentoo variants also point to different paths! – but this script does not take that into consideration at all. More to the point, the default, which uses /var/run might not be available at the shutdown init level since that would probably have been unmounted by that time. What I should do, I guess, is make it possible for the init script to fetch the configured value from the apcuspd configuration file, and move the default to use /run.

Next problem in my list is that apcaccess should not be among the superuser binaries, since it can be run from user just fine, but I’ll have to get that cleared with upstream first, it might break some scripts to move it in Gentoo only.

Finally, there is the problem that the sources of apcupsd are written with disregard for what many consider “library-only problems” – namely PIC – and has a very nasty copy-on-write scorecard. Unfortunately, some of the issues are so rooted into the design that I don’t feel up to fix the sources myself, but if somebody wanted a project to follow and optimise, that might be a good choice in this respect.

Sigh. I hope to find more time to fix the remaining issues with the scripts soon. For now if you have comments/notes that I might have missed, your feedback is welcome.

Munin no[dt]es

Back when I was looking at entropy I took Jeremy’s suggestion and started setting up Munin on my boxes to have an idea of their average day situation. This was actually immensely useful to note that the EntropyKey worked fine, but the spike in load and processes caused entropy depletion nonetheless. After a bit of time spent tweaking, trying, and working with munin, I now have a few considerations I’d like to make you a part of.

First of all, you might remember my custom ModSecurity ruleset (which seems not to pick the interest of very many peopole). One of the things that my ruleset filters is requests coming from user-agents only reporting the underlying library version (such as libCurl, http-access2, libwww-perl, …), as most of the time they seem to be bound to be custom scripts or bots. As it happens, even the latest sources for Munin do not report themselves as being part of anything “bigger”, with the final result that you can’t monitor Apache when my ModSecurity ruleset is present, d’oh!

Luckily, Christian helped me out and provided me with a patch (that is attached to bug #370131 which I’ll come back to later) that makes munin plugins show themselves as part of munin when sending the requests, which stops my ruleset from rejecting the request altogether.

In the bug you’ll also find a patch that changes the apc_nis plugin, used to request data to APC UPSes though apcupsd, to graph one more variable, the UPS internal temperature. This was probably ignored before because it is not present on most low-end series, such as the Back-UPS, but it is available in my SmartUPS, which last year did indeed shut down abruptly, most likely because of overheating.

Unfortunately it’s not always as easy to fix the trouble; indeed one of the most obnoxious, for me, issues is that Munin does not support using IPv6 to connect to nodes. Unfortunately the problem lies in the Perl Net-Server module, and there is no known solution, barring some custom distribution patches — which I’d rather not ask to implement in Gentoo. For now I solved this by using ssh with port forwarding over address 127.0.0.2, which is not really nice, but works decently for my needs.

Another interesting area relates to sensors handling: Munin comes with a plugin to interpret the output of the sensors program from lm_sensors, which is tremendously nice to make sure that a system is not overheating constantly. Unfortunately this doesn’t always work as good. Turns out that the output format expected by Munin has changed quite a bit in the latest versions, namely the min/max/hyst tuple of extra values vary their name depending on the used chip. Plus it doesn’t take into consideration the option that one of the sensors is altogether disabled.

This last problem is what hit me on Raven (my frontend system). The sensors of the board – Asus AT5IONT-I – were not supported on Linux up to version 2.6.38; I could only watch over the values reported by coretemp, the Intel CPU temperature sensor. With versions 2.6.39, driver it8721 finally supported the board and I could watch over all the temperature values. But of the three temperature values available from the sensor, only two are actually wired, respectively to a thermal diode and a thermistor; the third results disabled, but is still reported by the sensors output with a value of –128°C which Munin then graphs. The only way I found to disable that, was to create a local sensors.d configuration file to stop the value from appearing.

One interesting note relates to the network I/O graphing: Munin didn’t seem to take very nice the rollover at the reboot of my router: it reported a peak of petabits I/O at the time. The problem turns out to be easy to solve, but not as easy to guess. When reading the ifconfig output, the if_ plugin does not discard any datapoint, even if it is bigger than the speed of any given interface, unless it can read the interface’s real speed through mii-tool. Unfortunately it can only do that if the plugin is executed as root, which obviously is not the default.

Another funny plugin is hddtemp_smartctl — rather than using the hddtemp command itself, it uses smartctl which is part of smartmontools, which I most definitely keep around more often than the former. Unfortunately it also has limitations; the most obnoxious is that it doesn’t allow you to just list a bunch of device paths. Instead you have to first provide it with a set of device names, and then you can override their default device path; this is further complicated by the fact that each of the paths is forced to have /dev/ prefixed. Since Yamato has a number of devices, and not always they seem to get the same letters, I had to set it up this way:

[hddtemp_smartctl]
user root
group disk
env.drives sd1 sd2 sd3 sd4 sd5 sd6
env.dev_sd1 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT15838155
env.dev_sd2 disk/by-id/ata-WDC_WD3202ABYS-01B7A0_WD-WCAT16241483
env.dev_sd3 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4499656
env.dev_sd4 disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATR4517732
env.dev_sd5 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2720459
env.dev_sd6 disk/by-id/ata-WDC_WD10EARS-00Z5B1_WD-WMAVU2721970

For those curious, the temperatures vary between 39°C for sd6 and 60°C for sd3.

I have to say I’m very glad I started using Munin, as it helps understanding a few important things, among which is the fact that I need to separate my storage to a separate system, rather than delegating everything to Yamato (I’ll do so as soon as I have a real office and some extra cash), and that using a bridge to connect the virtualised guests to my home network is not a good idea (having Yamato with the network card in promiscuous mode means that all the packets are received by it, even when they are directed to the access point connecting me to the router downstairs).

Laptop policies on a workstation

As I bought the new UPS, I now have quite some time to work with it while I’m without power, as the two UPSes can keep the system online for about one hour without any kind of interruptions, included network access. This is quite nice, but there is one thing I was thinking of today…

When the workstations are running on UPS’s battery power, it’s comparable to a laptop running on battery, so I could be just using the same method laptops use: reduce the frequency of the CPU. At least Enterprise is able of CPU frequency scaling, so I’ve configured it to switch to power saving profile when the onbattery event is received by apcupsd.

It was actually quite simple, first of all apcupsd runs as root (although it should drop all privileges, but I’m not sure about it), and I have powersave installed and configured to handle this, so I just changed the /etc/apcupsd/onbattery and /etc/apcupsd/offbattery scripts to add a call to powersave binary, with -e Powersave on the onbattery event and -e Performance on the offbattery event; unfortunately I don’t think there is anything else beside slowing the CPU speed during UPS battery supply that can be done on a Workstation :(

Now I’m going to look to split the connection of the power cables in my home office so that the hardware that would just go in standby while I’m sleeping would be connected to a single multi-socket, which can be disconnected at once, to avoid leaving the hardware consuming power without any reason. I’d like to be able to do the same with the monitors (that otherwise would remain in standby as they can’t be shut down. If I wasn’t using the same UPS for both networking and monitors, I could probably use the UPS to shut them down and turn them back on at my command.

Sigh, I’m afraid there will be a looot of work to do, and I’m sealing envelopes, once again :(

A nice productive day

I might be sick, or just crazy, or both of them, but I still think I’m quite more productive when I have fever, or the days around that time. Yesterday I had fever, and I was knock out till late afternoon, but then I started feeling better, and I started producing.

First of all, rbot’s init script in my overlay has been updated: Subversion trunk will now create a rbot.pid file inside the bot’s directory without need for start-stop-daemon to create it itself, which means I can finally let rbot fork on its own rather than forcing it to background (which is not really a good thing to do, if it can be avoided). Thanks to tango and jbn for looking after my raw patch.

Then I decided to finish the work with apcupsd; again in my overlay you can find a new ebuild for 3.14.0 based on the one found on bugzilla, but with a new apccontrol file and a totally renewed init script. This script can be multiplied, which means you can have a /etc/init.d/apcupsd.foo link, which would then start apcupsd looking for /etc/apcupsd/foo.conf configuration file. Thanks to apcupsd authors for implementing this feature in 3.14!

I’ve also attached the two UPSes to Farragut instead of Enterprise, as the latter is not a server and might as well be offline when Farragut is still up (for instance this is the case most of the times I’m outside for the whole day, or if there is noone at home); apcupsd on FreeBSD works nicely, and doesn’t require any fiddling with configuration, neither kernel side (the ugen support is built by default) nor with permissions (as the default is to run as root, this might change in the future, but as it is it’s fine to me; I’ll be working on a better handling of permissions on device nodes for Gentoo/FreeBSD, but it’s not in my priority list at the moment). Also here, they work perfectly fine.

I was also able to fix a bug in xine-lib, with mp4/mov files playback that used version 1 rather than version 0 of the media header atom, such as files generated by FFmpeg. The bug was reported on sourceforge already but I wasn’t sure what it actually meant and where to find a sample file; when I generated the same condition by chance here, I decided to take a deeper look; unfortunately MultimediaWiki doesn’t provide much information about that, but I asked Mike to give me an account there so I can try to write something useful, maybe next time someone else needs a mdhd atom description they won’t have to look at the sources of FFmpeg to see how it’s read and generated.

Then tonight I wanted to resume my work on implementing audio conversions inside the audio output loop instead of doing it for every decoder; it’s an hard work as it probably will require rewriting a good deal of code, but it should be rewarding once it’s done. Right now there are a bunch more of flag values for capabilities, so for instance I can say if a drivers supports integer or float 32-bit samples, 64-bit samples, and if it can accept streams in a different endianness. This is important because there’s little point in doing the job of the output plugin, that might handle that transparently, for instance a big-endian stream might be decoded on a little-endian machine, then sent through PulseAudio to a big-endian machine where it will be reproduced: in this setup, xine’s endian reversal of the stream (from big to little endian) would have been superfluous, as PulseAudio would have accepted the big-endian samples, then sent them to the other machine that needed not to reverse them to reproduce them.

Anyway, right now the code is quite fragile, there’s no conversion being done, there are mostly only things that are totally broken out, there are asserts 1 == 0 used to mark the code that needs to be rewritten. But something works: I was able to remove a lot of code from the musepack decoder, as libmpcdec always produces 32-bit native-endian (or maybe little-endian, I’m not yet sure) floating point samples; previously the decoder converted all the samples back to 16-bit format, and then gave it to the audio output loop to handle… now instead it sends them directly to the output, and as PulseAudio supports 32-bit float samples, they are not converted and play back fine.

Tomorrow I’ll see to work a way to handle upsamping and downsampling of streams, the problem is that it’s not trivial to decide what to do: if a plugin supports 32-bit integer samples, but not 24-bit integer samples, it should probably upscale the 24-bit to 32-bit to avoid losing precision; if it doesn’t it might upscale it to 32-bit float, or maybe downscale it to 16-bit integers. The same applies to channel mode, if the driver doesn’t support stereo output, should it be updated to 4.0 or should it be downgraded to mono?

For sure this time I’m very happy of being working on branches: leaving the code broken for weeks, maybe months, is not something you want to do on the main development branch. And I mean it, because with the changes I’m doing, not only I’ll be changing the ABI of the library itself (well, actually not much, just a couple of structures), but more importantly I’ll be changing the audio output plugins API, as I need to feed them a sample format rather than a bits-per-sample size.

Anyway, this is not going to be something easy to complete, but it will be a noticeable improvement for Amarok users once done, especially because I want to make sure that the capabilities for “mixer” volume and “PCM” volume are cleared up, probably by deprecating one of them, so that Amarok can be changed not to use xine’s software amplification (which also sucks and I also need to rewrite in good part in this branch) if the output plugin actually supports a per-stream volume (like PulseAudio).

Sponsoring, bribing, and comfort words are welcome, as xine-lib’s audio_out code is giving me creeps.

Downtime and new UPS

Sorry for the downtime guys, the update to baselayout 2 on Farragut wasn’t as easy as expected, I have some talk to do with Roy about it ;)

Now, this is probably going to be the fourth night I’m not gonna sleep well, today I had a support request to take care of in the afternoon and I was supposed to go out with a few friends tonight, but I was so tired that I started having fever, so I had to get a raincheck for that.

The good news is that I received the new UPS, an APC SmartUPS 1000VA, that seems to be able to take enterprise and farragut up for 1 hour and a quarter, which is quite good (the other UPS would take the monitors and the network equipment up for also an hour, which is not bad, and should cover most of short-time non-planned outages.

Unfortunately, if you have followed me for a while, you know there’s no flawless hardware acquisition for me; this time the problem comes from the software needed to control the two UPSes; please note that there are two.

First of all, when I started using an UPS, I used apcupsd but then I moved to nut because it had a decent graphical control software (knutclient); what was the problem with this? Well, nut was confused by the two UPSes, that being both APC, shares the same identical vendor and product ID for the USB device. So it was not a nice thing.

I’ve then decided to come back to apcupsd, but the latest version is not in portage (again) so I bumped it locally with the patch on Bugzilla, and added a gnome useflag for the graphical control utility that is now present. This version is important to me, not only because of the utility, but also because with this version is possible to choose the configuration file at runtime, as it’s not hardcoded during build.

This is also important because in previous versions the only way to have more than one UPS being monitored on the same box with apcupsd required to have two different builds on different places. What I want to do now is to rewrite the init script so that it is multiplexed (like rbot, mt-daapd, openvpn and so on), that way I can simply start two apcupsd instances to have all I need (note that one of them won’t be shutting off my box at all, as it would just keep monitoring the UPS that is actually connected to the monitors and networking.

Once Baselayout 2 is more functional on Farragut, I’ll probably move apcupsd instances there, and network it down here, it would be safer on the long run as enterprise might not be up when farragut is.

I still haven’t been able to finish working on rbot’s changes I need, sigh. And don’t get me started on xine-lib, I’m taking a few days off because what I found myself working on now is a veeery bad thing I’m afraid…. (bit per sample transcoding during output).