Some UPS notes (apcupsd et al)

If you didn’t notice, one of the packages I’ve been maintaining in Gentoo is apcupsd that is one of the daemons that can be used to control APC-produced UPS units. But for quite a while, my maintenance of the package was mostly limited to keeping it in a working state (with a wide range of different results, to be honest), since from the original (messy) ebuild I originally inherited, the current one is quite linear and, in my opinion, elegant.

But in the past two weeks or so, a few things happened that required me to look into the package more closely: a version bump, a customer having two UPSes connected to the same system, and the only remaining non-APC UPS in my home office declaring itself dead.

The version bump was the chance for me to finally fix the strict aliasing issue that is still present; this time, instead of simply disabling strict aliasing (quick, hacky way) I decided to look at the code, to make it actually strict aliasing compliant. This might not sound like much, but this kind of warnings is particularly nasty as you never know when it will cause an issue. Besides, it caused Portage to abort in stricter mode, that is what I use for packages I maintain myself.

Also, while my customer’s needs didn’t really influence my work on apcupsd itself, it caused me to look even more into munin’s apc_nis plugin as beforehand it was not configurable at all: it only ever used localhost:3551 to connect to APC NIS interface, which meant that if you wanted to change the port, or make it only listen on an external interface, you were out of luck. The patch to make this configurable is now part of Munin trunk, but I haven’t had time to ask Jeremy to add it to Gentoo as well (the few patches of mine to Munin are all merged upstream now, and Munin 2 will have those, and finally, native IPv6 transport, which means I probably won’t need to use ssh connections to fetch data over NAT, but just properly-configured firewalls).

There is another issue that comes up when having multiple UPS connected to the same box though: permanence of device names. While the daemon auto-discovers a single connected APC device, when you have multiple devices you need to explicitly tell it to access a given one. To do so, you could use the hiddev device paths, but the kernel does not make those persistent if you connect/disconnect the units. To solve this issue, the new ebuild for apcupsd that I committed today uses udev rules to provide /etc/apcups/usb-${SERIALNO} symlinks that you can use to provide stable reference to your apcupsd instances. I sent the rules upstream, hoping that they’ll be integrated in the next release.

A note here: while I’m a fan of autoconfiguration, I’m having trouble considering the idea of having apcupsd auto-started when an APC UPS is connected. The reason is not obvious though: while it would work fine if it was the only UPS and thus the only apcupsd instance to be present, if you had a second instance set up for a different UPS there would be no way to match the two together. This is at a minimum bothersome.

Speaking about init scripts, the powerfail init script currently only works in single-UPS configurations (whereas the main init script works fine in multiple UPS configurations), and even there it is a bit … broken. The powerfail flag can be written in a number of different places – the default and the Gentoo variants also point to different paths! – but this script does not take that into consideration at all. More to the point, the default, which uses /var/run might not be available at the shutdown init level since that would probably have been unmounted by that time. What I should do, I guess, is make it possible for the init script to fetch the configured value from the apcuspd configuration file, and move the default to use /run.

Next problem in my list is that apcaccess should not be among the superuser binaries, since it can be run from user just fine, but I’ll have to get that cleared with upstream first, it might break some scripts to move it in Gentoo only.

Finally, there is the problem that the sources of apcupsd are written with disregard for what many consider “library-only problems” – namely PIC – and has a very nasty copy-on-write scorecard. Unfortunately, some of the issues are so rooted into the design that I don’t feel up to fix the sources myself, but if somebody wanted a project to follow and optimise, that might be a good choice in this respect.

Sigh. I hope to find more time to fix the remaining issues with the scripts soon. For now if you have comments/notes that I might have missed, your feedback is welcome.

Compilers’ rant

Be warned that this blog’s style is in form of a rant, because I’ve spent the past twelve hours fighting with multiple compilers trying to make sense of them while trying to get the best out of my unpaper fork thanks to the different analysis.

Let’s start with a few more notes about the Pathscale compiler I already slightly ranted about for my work on Ruby-Elf. I know I didn’t do the right thing when I posted that stuff as I should have reported the issues upstream directly, but I didn’t have much time, I was already swamped with other tasks, and going through a very bad personal moment, so I quickly written up my feelings without doing proper reports. I have to thank Pathscale people for accepting the critiques anyway, as Måns reported me that a couple of the issues I noted, in particular the use of --as-needed and the __PIC__ definition were taken care of (sorta, see in a moment).

First problem with the Pathscale compiler: by mistake I have been using the C++ compiler to compile C code; rather than screaming at me, it passed through properly, with one little difference: a static constant gets mis-emitted and this is not a minor issue at all, even though I am using the wrong compiler! Instead of having the right content, the constant is emitted as an empty, zeroed-out array of characters of the right size. I only noticed because of Ruby-elf’s cowstats reporting what should have been a constant into the .bss section. This is probably the most worrisome bug I have seen with Pathscale yet!

Of course its impact is theoretically limited by the fact that I was using the wrong compiler, but since the code should be written in a way to be both valid C and C+, I’m afraid the same bug might exist for some properly-written C+ code.. I hope it might get fixed soon.

The killer feature for Pathscale’s compiler is supposedly optimisation, though, and in that it looks like it is doing quite a nice job, indeed I can see from the emitted assembler that it is finding more semantics to the code than GCC seems to, even though it requires -O3 -march=barcelona to make something useful out of it — and in that case you give up debugging information as the debug sections may reference symbols that were dropped, and the linker will be unable to produce a final executable. This is hit and miss of course, as it depends on whether the optimiser will drop those symbols, but it makes difficult to use -ggdb at all in these cases.

Speaking about optimisations, as I said in my other post, GCC’s missed optimisation is still missed by Pathscale even with full optimisation (-O3) turned on, and with the latest sources. And is also still not fixed the wrong placement of static constants that I ranted about in that post.

Finally, for what concerns the __PIC__ definition that Måns referred as being fixed, well, it isn’t really as fixed as one would expect. Yes, using -fPIC now implies defining __PIC__ and __pic__ as GCC does, but there are two more issues:

* While this does not apply to x86 and amd64 (but just for m68k, PowerPC and Sparc), GCC supports two modes for emission of position-independent code, one that is limited by the architecture’s global offset table maximum size, and the other that overrides such maximum size (I never investigated how it does that, probably through some indirect tables). The two options are enabled through -fpic (or -fpie) and -fPIC (-fPIE) and define the macros as 1 and 2, respectively; Path64 does only ever define them to 1.
* With GCC, using
-fPIE@ – that is used to emit Position Independent Executables – or the alternative -fpie of course, implies the use of -fPIC, which in turn means that the two macros noted above are defined; at the same time, two more are defined, __pie__ and __PIE__ with the same values as described in the previous paragraph. Path64 defines none of these four macros when building PIE.

But enough rant about Pathscale, before they feel I’m singling them out (which I’m not). Let’s rant a bit about Clang as well, the only compiler up to now that properly dropped write-only unit-static variables. I had very high expectations for what concerns improving unpaper through its suggestions but.. it turns out it cannot really create any executable, at least that’s what autoconf tells me:

configure:2534: clang -O2 -ggdb -Wall -Wextra -pipe -v   conftest.c  >&5
clang version 2.9 (tags/RELEASE_29/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
 "/usr/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name conftest.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-linker-version 2.21.53.0.2.20110804 -momit-leaf-frame-pointer -v -g -resource-dir /usr/bin/../lib/clang/2.9 -O2 -Wall -Wextra -ferror-limit 19 -fmessage-length 0 -fgnu-runtime -fdiagnostics-show-option -o /tmp/cc-N4cHx6.o -x c conftest.c
clang -cc1 version 2.9 based upon llvm 2.9 hosted on x86_64-pc-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/bin/../lib/clang/2.9/include
 /usr/include
 /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.1/include
End of search list.
 "/usr/bin/ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o crtbegin.o -L -L/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/../../.. /tmp/cc-N4cHx6.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed crtend.o /usr/lib/../lib64/crtn.o
/usr/bin/ld: cannot find crtbegin.o: No such file or directory
/usr/bin/ld: cannot find -lgcc
/usr/bin/ld: cannot find -lgcc_s
clang: error: linker command failed with exit code 1 (use -v to see invocation)
configure:2538: $? = 1
configure:2576: result: no

What’s going on? Well, Clang doesn’t provide its own crtbegin.o file for the C runtime prologue (while Path64 does), so it relies on the one provided by GCC, which has to be on the system somewhere. Unfortunately, to identify where this file is… they try hitting and missing.

% strace -e stat clang test.c -o test |& grep crtbegin.o
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/crtbegin.o", 0x7fffc937f170)     = -1 ENOENT (No such file or directory)
stat("/../../../../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/usr/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/../../../crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)

Yes you can see that it has a hardcoded list of GCC versions that it looks for, from higher to lower, until it falls back to some generic paths (which don’t make that much sense to me to be honest, but nevermind). This means that on my system, that only has GCC 4.6.1 installed, you can’t use clang. This was reported last week and while a patch is available, a real solution is still not there: we shouldn’t be patching and bumping clang each time a new micro version of GCC is released that upstream didn’t list already!

Sigh. While GCC sure has its shortcomings, this is not really looking promising either.

Shared libraries worth their while

This is, strictly speaking, a non Gentoo-related post; on the other hand, I’m going to introduce here a few concepts that I’ll use in a future post to explain one Gentoo-specific warning, so I’ll consider this a prerequisite. Sorry if you feel like Planet Gentoo should never go over the technical non-Gentoo work, but then again, you’re free to ignore me.

I have, in the past, written about the need to handle shared code in packages that install multiple binaries (real binaries, not scripts!) to perform various tasks, which end up sharing most of their code. Doing the naïve thing, using the source code in all of them, or the slightly less naïve thing, building a static library and linking it to all the binaries, tend to increase the size of the commands on disk, and the memory required to fully load them in memory. In my previous post I noted a particularly nasty problem with the smbpasswd binary, that was almost twice the size because of unused code injected by the static convenience library (and probably even more, given that I never went down for to hide the symbols and clean them up).

In another post, I also proposed the use of multicall binaries to handle these situations; the idea behind multicall binaries is that you end up with a single program, with multiple “applets”; all the code is merged into a single ELF binary object, and at runtime the correct path is taken to call the right applet, depending on the name used to call up the binary. It’s not extremely easy but not even impossible to get right, so I still suggest that as main alternative to handle shared code, when the shared code is bigger in size than the single applet’s code.

This does not solve the Samba situation though: the code of the single utilities is still big enough than having a single super-package will not make it very manageable, and a different solution has to be devised. In this case you end up having to choose between the static linking (naïve approach) or using a private, shared object. An easy way out here is trying to be sophisticated, and always go with the shared object approach; it definitely might not be the best option.

Let me be clear here: shared objects are not a panacea to the shared code problems. As you might have heard already, using shared objects is generally a compromise: you relax problems related to bugs and security vulnerability, by using a shared object, so that you don’t have to rebuild all the software using that code — and most of the times you also share read-only memory to reduce the memory consumption of a system — at the expense of load time (the loader has to do much more work), sometime execution speed (PIC takes its toll), and sometimes memory usage, as counter-intuitive as that might sound, given that I just said that they reduce memory consumption.

While the load time and execution speed tolls are pretty much immediate to understand, and you can find a lot of documentation about them on the net, it’s less obvious to understand the share-memory, waste-memory situation. I wrote extensively about the Copy-on-Write problem so if you follow my blog regularly you might have guessed the problem already at this point, but it does not fill in all the gaps yet, so let me try to explain how this compromise works.

When we use ELF objects, part of the binary file itself are shared in memory across different processes (homogeneous or heterogeneous). This means that only those parts that would not be modified from the ELF files can be shared. This usually includes the executable code – text – for standard executables (and most code compiled with PIC support for shared objects, which is what we’re going to assume), and part (most) of the read-only data. In all cases, what breaks the share for us is Copy-on-Write, as that will create private copies of the pages to the single process, which is why writeable data is nothing we care about when choosing the code-sharing strategy (it’ll mostly be the same whether you link it statically or via shared objects — there are corner cases, but I won’t dig into them right now).

What is that talking about homogeneous or heterogeneous processes above? Well, it’s a mistake to think that the only memory that is shared in the system is due to shared objects: read-only text and data for an ELF executable file are shared among processes spawned from the same file (what I called and will call homogeneous processes). What shared object accomplish with memory is sharing between processes spawned by different executables, but loading the same shared objects (heterogeneous processes). The KSM implementation (no it’s not KMS!) in the current versions of the Linux kernel allows for something similar, but it’s a story so long that I won’t really bother count it in.

Again, the first approach to shared objects might make you think that moving whatever amount of memory from being shared between homogeneous processes to be shared between heterogeneous processes is a win-win situation. Unfortunately you have to cope with data relocations (which is a topic I wrote about extensively): a constant pointer is read-only when the code is always loaded at a given address (as it happens with most standard ELF executables), but it’s not when the code can be loaded at an arbitrary address (as it happens with shared objects): in the latter case it’ll end up in the relocated data section, which follows the same destiny as the writeable data section: it’s always private to the single process!

*Note about relocated data: in truth you could ensure that the data relocation is the same among different processes, by using either prelinking (which is not perfect especially with modern software, which is more and more plugin-based), or methods like KDE’s kdeinit preloading. In reality, this is not really something you could, or should, rely upon because it also goes against the strengthening of security applied by Address Space Layout Randomization.*

So when you move shared code from static linking to shared objects, you have to weight in the two factors: how much code will be left untouched by the process, and how much will be relocated? The size utility from either elfutils or binutils will not help you here, as it does not tell you how big the relocated data section is. My ruby-elf instead has an rbelf-size script that gives you the size of .data.rel.ro (another point here: you only care about the increment in size of .data.rel.ro as that’s the one that is added as private: .data.rel would be part of the writeable data anyway). You can see it in action here:

flame@yamato ruby-elf % ruby -I lib tools/rbelf-size.rb /lib/libc.so.6
     exec      data    rodata     relro       bss     total filename
   960241      4507    359020     12992     19104   1355864 /lib/libc.so.6

As you can see from this output, the C library has some 950K of executable code, 350K of read-only data (both will be shared among heterogeneous processes) and just 13K (top) of additional relocated memory, compared to static linking. _Note: the rodata value does not only include .rodata but all the read-only non-executable sections; the value of exec and rodata roughly corresponds of what size calls text).

So how is knowing how much relocated data useful in assessing how to deal with shared code? Well, if you build your shared code as shared object and analyse it with this method (hint: I just implemented rbelf-size -r to reduce the columns to the three types of memory we have in front of us), you’ll have a rough idea of how much gain and how much waste you’ll have for what concern memory: the higher the shared-to-relocated ratio, the better results you’ll have. Having an infinite ratio (when there is no relocated data) it’s the perfection.

Of course the next question is what do you do if you have a low ratio? Well there isn’t really a correct answer here: you might decide to bite the bullet and go in the code to improve the ratio; cowstats from the Ruby-Elf suite helps you to do just that; it can actually help you reducing your private sections as well, as many times you have mistake in there, due to missing const declarations. If you have already done your best to reduce the relocations, then your only chance left is to avoid using a library altogether; if you’re not going to improve your memory usage by using a library, and it’s something internal only, then you really should look into using either static linking or, even better, multicall binaries.

Impootant Notes of Caution

While I’m trying to go further on the topic of shared objects than most documentation I have read myself on the argument, I have to point out that I’m still generalising a lot! While the general concept are as I put them down here, there are some specific situations that change the table making it much more complex: text relocations, position independent executables, PIC overhead, are just some of the problems that might arise while trying to apply these general ideas over specific situations.

Still trying not to dig too deep on the topic right away, I’d like to spend a few words about the PIE problem, which I have already described and noted in the blog: when you use Position Independent Executables (which is usually done to make good use of the ASLR technique), you can discard the whole check of relocated code: almost always you’ll have good results if you use shared objects (minus complications added by the overlinking, of course). You still would have the best results by using multicall binaries if the commands have very little code.

Also, please remember that using shared objects slows down the loading process which means that if you have a number of fire-and-forget commands, which is something not too unusual in the UNIX-like environments, you will probably have best results with multicall, or static linking, than with shared objects. The shared memory is also something that you’ll probably get to ignore in that case, as it’s only worth its while if you normally keep the processes running for a relatively long time (and thus loaded into memory).

Finally, all I said refers to internal libraries used for sharing code among commands of the same package: while most of the same notes about performance-wise up- and down-sides holds true for all kind of shared objects, you have to factor in the security and stability problems when you deal with third-party (or third-party-available) libraries — even if you develop them yourself and ship them with your package: they’ll still be used by many other projects so you’ll have to handle them with much more care, and they should really be shared.

Plugins aren’t always a good choice

I’ve been saying this for quite a while, probably one of the most on-topic post has been written a few months ago but there are some indications about it in posts about xine and other again.

I used to be an enthusiast about plugin interfaces; with time, though, I started having more and more doubts about their actual usefulness — it’s a tract I really much like in myself, I’m fine with reconsidering my own positions over time, deciding that I was wrong; it happened before with other things, like KDE (and C++ in general).

It’s not like I’m totally against the use of plugins altogether. I only think that they are expensive in more ways than one, and that their usefulness is often overstated, or tied to other kind of artificial limitations. For instance, dividing a software’s features over multiple plugins makes it easier for the binary distributions to package them, usually: they only have to ship a package with the main body of the software, and many for the plugins (one per plugin might actually be too much so sometimes they might be grouped). This works out pretty well for both the distribution and, usually, the user: the plugins that are not installed will not bring in extra dependencies, they won’t take time to load and they won’t use memory for either code nor data. It basically allows binary distribution to have a flexibility to compare with Gentoo’s USE flags (and similar options in almost any other source-based distribution).

But as I said this comes with costs, that might or might not be worth it in general. For instance, Luca wanted to implement plugins for feng similarly to what Apache and lighttpd have. I can understand his point: let’s not load code for the stuff we don’t have to deal with, which is more or less the same reason why Apache and lighttpd have modules; in the case of feng, if you don’t care about access log, why should you be loading the access load support at all? I can give you a couple of reasons:

  • because the complexity of managing a plugin to deal with the access log (or any other similar task) is higher than just having a piece of static code that handles that;
  • because the overhead of having a plugin loaded just to do that is higher than that of having the static code built in and not enabled into configuration.

The first problem is a result of the way a plugin interface is built: the main body of the software cannot know about its plugins in too specific ways. If the interface is a very generic plugin interface, you add some “hook locations” and then it’s the plugin’s task to find how to do its magic, not the software’s. There are some exceptions to this rule: if you have a plugin interface for handling protocols, like the KIO interface (and I think gvfs has the same) you get the protocol from the URL and call the correct plugin, but even then you’re leaving it to the plugin to deal with doing its magic. You can provide a way for the plugin to tell the main body what it needs and what it can do (like which functions it implements) but even that requires the plugins to be quite autonomous. And that means also being able to take care of allocating and freeing the resources as needed.

The second problem is not only tied to the cost of calling the dynamic linker dynamically to load the plugin and its eventual dependencies (which is a non-trivial amount of work, one has to say), also by the need for having code that deals with finding the modules to load, the loading of those modules, their initialisation, keeping a list of modules to call at any given interface point, and two more points: the PIC problem and the problem of less-than-page-sized segments. This last problem is often ignored, but it’s my main reason to dislike plugins when they are not warranted for other reasons. Given a page size of 4KiB (which is the norm on Linux for what I know), if the code is smaller than that size, it’ll still require a full page (it won’t pack with the rest of the software’s code areas); but at least code is disk-backed (if it’s PIC, of course), it’s worse for what concerns variable data, or variable relocated data, since those are not disk-backed, and it’s not rare that you’d be using a whole page for something like 100 bytes of actual variables.

In the case of the access log module that Luca wrote for feng, the statistics are as such:

flame@yamato feng % size modules/.libs/mod_accesslog.so
   text    data     bss     dec     hex filename
   4792     704      16    5512    1588 modules/.libs/mod_accesslog.so

Which results in two pages (8KiB) for bss and data segments, neither disk-backed, and two disk-backed pages for the executable code (text): 16KiB of addressable memory for a mapping that does not reach 6KiB, it’s a 10KiB overhead, which is much higher than 50%. And that’s the memory overhead alone. The whole overhead, as you might guess at this point, is usually within 12KiB (since you got three segments, and each can have at most one byte less than page size as overhead — it’s actually more complex than this but let’s assume this is true).

It really doesn’t sound like a huge overhead by itself, but you have to always judge it compared to the size of the plugin itself. In the case of feng’s access log, you got a very young plugin that lacks a lot of functionality, so one might say that with the time it’ll be worth it… so I’d like to show you the size statistics for the Apache modules on the very server my blog is hosted. Before doing so, though, I have to remind you one huge difference: feng is built with most optimisations turned off, while Apache is built optimised for size; they are both AMD64 though so the comparison is quite easy.

flame@vanguard ~ $ size /usr/lib64/apache2/modules/*.so | sort -n -k 4
   text    data     bss     dec     hex filename
   2529     792      16    3337     d09 /usr/lib64/apache2/modules/mod_authn_default.so
   2960     808      16    3784     ec8 /usr/lib64/apache2/modules/mod_authz_user.so
   3499     856      16    4371    1113 /usr/lib64/apache2/modules/mod_authn_file.so
   3617     912      16    4545    11c1 /usr/lib64/apache2/modules/mod_env.so
   3773     808      24    4605    11fd /usr/lib64/apache2/modules/mod_logio.so
   4035     888      16    4939    134b /usr/lib64/apache2/modules/mod_dir.so
   4161     752      80    4993    1381 /usr/lib64/apache2/modules/mod_unique_id.so
   4136     888      16    5040    13b0 /usr/lib64/apache2/modules/mod_actions.so
   5129     952      24    6105    17d9 /usr/lib64/apache2/modules/mod_authz_host.so
   6589    1056      16    7661    1ded /usr/lib64/apache2/modules/mod_file_cache.so
   6826    1024      16    7866    1eba /usr/lib64/apache2/modules/mod_expires.so
   7367    1040      16    8423    20e7 /usr/lib64/apache2/modules/mod_setenvif.so
   7519    1064      16    8599    2197 /usr/lib64/apache2/modules/mod_speling.so
   8583    1240      16    9839    266f /usr/lib64/apache2/modules/mod_alias.so
  11006    1168      16   12190    2f9e /usr/lib64/apache2/modules/mod_filter.so
  12269    1184      32   13485    34ad /usr/lib64/apache2/modules/mod_headers.so
  12521    1672      24   14217    3789 /usr/lib64/apache2/modules/mod_mime.so
  15935    1312      16   17263    436f /usr/lib64/apache2/modules/mod_deflate.so
  18150    1392     224   19766    4d36 /usr/lib64/apache2/modules/mod_log_config.so
  18358    2040      16   20414    4fbe /usr/lib64/apache2/modules/mod_mime_magic.so
  18996    1544      48   20588    506c /usr/lib64/apache2/modules/mod_cgi.so
  20406    1592      32   22030    560e /usr/lib64/apache2/modules/mod_mem_cache.so
  22593    1504     152   24249    5eb9 /usr/lib64/apache2/modules/mod_auth_digest.so
  26494    1376      16   27886    6cee /usr/lib64/apache2/modules/mod_negotiation.so
  27576    1800      64   29440    7300 /usr/lib64/apache2/modules/mod_cache.so
  54299    2096      80   56475    dc9b /usr/lib64/apache2/modules/mod_rewrite.so
 268867   13152      80  282099   44df3 /usr/lib64/apache2/modules/mod_security2.so
 288868   11520     280  300668   4967c /usr/lib64/apache2/modules/mod_passenger.so

The list is ordered for size of the whole plugin (summed up, not counting padding); the last three positions are definitely unsurprisingly, although it surprises me the sheer size of the two that are not part of Apache itself (and I start to wonder whether they link something in statically that I missed). The fact that the rewrite module was likely the most complex plugin in Apache’s distribution never left me.

As you can see, almost all plugins have vast overhead especially for what concerns the bss segment (all of them have at least 16 bytes used, and that warrants a whole page for them: 4080 bytes wasted each); the data segment is also interesting: only the two external ones have more than a page worth of variables (which also is suspicious to me). When all the plugins are loaded (like they most likely are right now as well on my server) there are at least 100KiB of overhead; just for the sheer fact that these are plugins and thus have their own address space. Might not sound like a lot of overhead indeed, since Apache is requesting so much memory already, especially with Passenger running, but it definitely doesn’t sound like a good thing for embedded systems.

Now I have no doubt that a lot of people like the fact that Apache has all of those as plugins as they can then use the same Apache build across different configurations without risking to have in memory more code and data than it’s actually needed, but is that right? While it’s obvious that it would be impossible to drop the plugin interface from Apache (since it’s used by third-party developers, more on that later), I would be glad if it was possible to build in the modules that come with Apache (given I can already choose which ones to build or not in Gentoo). Of course I also am using Apache with two configurations, and for instance the other one does not use the authentication system for anything, and this one is not using CGI, but is the overhead caused by the rest of modules worth the hassle, given that Apache already has a way to not initialise the unused built-ins?

I named above “third party developers” but I have to say now that it wasn’t really a proper definition, since it’s not just what third parties would do, it might very well be the original developers who might want to make use of plugins to develop separate projects for some (complex) features, and have different release handling altogether. For uses like that, the cost of plugins is often justifiable; and I am definitely not against having a plugin interface in feng. My main beef is when the plugins are created for functions that are part of the basic featureset of a software.

Another unfortunately not uncommon problem with plugins is that the interface might be skewed by bad design, like the case was (and is) for xine: when trying to open a file, it has to pass through all the plugins, so it loads all of them into memory, together with the libraries they depend on, to ask each of them to test the current file; since plugins cannot really be properly unloaded (and it’s not just a xine limitation) the memory will still be used, the libraries will still be mapped into memory (and relocated, causing copy on write, and thus, more memory) and at least half the point of using plugins has gone away (the ability to only load the code that is actually going to be used). Of course you’re left with the chance that an ABI break does not kill the whole program, but just the plugin, but that’s a very little advantage, given the cost involved in plugins handling. And the way xine was designed, it was definitely impossible to have third-party plugins developed properly.

And to finish off, I said before that plugins cannot be cleanly unloaded: the problem is not only that it’s difficult to have proper cleanup functions for plugins themselves (since often the allocated resources are stored within state variables), but also because some libraries (used as dependency) have no cleanup altogether, and they rely (erroneously) on the fact that they won’t be unloaded. And even when they know they could be unloaded, the PulseAudio libraries, for instance, have to remain loaded because there is no proper way to clean up Thread-Local Storage variables (and a re-load would be quite a problem). Which drives away another point of using plugins.

I leave the rest to you.

The PIE is not exactly a lie…

Update (2017-09-10): The bottom line of this article changed since the 8 years it was posted, quite unsurprisingly. Nowadays, vanilla kernel has a decent ASLR and so everyone does actually have advantages in building everything as PIE. Indeed, that’s what Arch Linux and probably most other binary distributions do exactly that. The rest of the technical description of why this is important and how is still perfectly valid.

One very interesting misconception related to Gentoo, and especially the hardened sub-profile, is related to the PIE (Position-Independent Executable) support. This is probably due to the fact that up to now the hardened profile always contained PIE support, and since it relates directly to PIC (Position-Independent Code) and PIC as well is tied back to hardened support, people tend to confuse what technique is used for what scope.

Let’s start with remembering that PIC is a compilation option that produces the so-called relocatable code; that is, code that is valid no matter what base address it is loaded at. This is a particularly important feature for shared objects: to be able to be loaded by any executable and still share the code pages in memory, the code needs to be relocatable; if it’s not, a text relocation has to happen.

Relocating the “text” means changing the executable code segment so that the absolute addresses (of both functions and data — variables and constants) is correct for the base address the segment was loaded at. Doing this, causes a Copy-on-Write for the executable area, which among other things, wastes memory (each process running will have to have its private copy of the executable memory area, as well as the variable data memory area). This is the reason why shared objects in almost any modern distribution are built relocatable: faster load time, and reduced memory consumption, at the cost of sacrificing a register.

An important note here: sacrificing a register, which is something needed for PIC to keep the base address of the loaded segment, is a minuscule loss for most architectures, with the notable exception of x86, where there are very few general registers to use. This means that while PIC code is slightly (but not notably) slower for any other architecture, it is a particularly heavy hit on x86, especially for register-hungry code like multimedia libraries. For this reason, shared objects on x86 might still be built without PIC enabled, at the cost of load time and memory, while for most other architectures, the linker will refuse to produce a shared object if the object files are not built with PIC.

Up to now, I said nothing about hardened at all, so let me introduce the first relation between hardened and PIC: it’s called PaX in Hardened Linux, but the same concept is called W^X (Write xor eXecute) in OpenBSD – which is probably a very descriptive name for a programmer – NX (No eXecution) in CPUs, and DEP (Data Execution Prevention) in Windows. To put it in layman terms, what all these technologies do is more or less the same: they make sure that once a memory page is loaded with executable code, it cannot be modified, and vice-versa that a page that can be modified cannot be executed. This is, like most of the features of Gentoo Hardened, a mitigation strategy, that limits the effects of buffer overflows in software.

For NX to be useful, you need to make sure that all the executable memory pages are loaded and set in stone right away; this makes text relocation impossible (since they consists of editing the executable pages to change the absolute addresses), and also hinders some other techniques, such as Just-In-Time (JIT) optimisation, where executable code is created at runtime from an higher, more abstract language (both Java and Mono use this technique), and C nested functions (or at least the current GCC implementation, that makes use of trampolines, and thus require executable stack).

Does any of this mean that you need PIC-compiled executables (which is what PIE is) to make use of PaX/NX? Not at all. In Linux, by default, all executables are loaded at the same base address, so once the code is built, it doesn’t have to be relocated at all. This also helps optimising the code for the base case of no shared object used, as that’s not going to have to deal with PIC-related problems at all (see this old post for more detailed information about the issue).

But in the previous paragraph I did write some clue as to what the PIE technique is all about; as I said, the reason why PIE is not necessary is that by default all executables are loaded at the same address; but if they weren’t, then they’d be needing either text relocations or PIC (PIE), wouldn’t they? That’s the reason why PIE exists indeed. Now, the next question would be, how does PIE relate to hardened? Why does the hardened toolchain use PIE? Does using it make it magically possible to have a hardened system?

Once again, no, it’s not that easy. PIE is not, by itself, neither a security measure nor a mitigation strategy. It is, instead, a requirement for the combined use of two mitigation strategy, the first is the above-described NX idea (which rules out the idea of using text relocations entirely), while the second is is ASLR (Address Space Layout Randomization). To put this technique also in layman terms, you should consider that a lot of exploit require that you change the address a variable points to, so you need to know both the address of that variable, and the address to point it to; to find this stuff out, you can usually try and try again until you find the magic values, but if you randomize the addresses where code and data are loaded each time, you make it much harder for the attacker to guess them.

I’m pretty sure somebody here is already ready to comment that ASLR is not a 100% safe security measure, and that’s absolutely right. Indeed here we have to make some notes as to which situation this really works out decently: local command exploits. When attacking a server, you’re already left to guess the addresses (since you don’t know which of many possible variants of the same executable the server is using; two Gentoo servers rarely have the same executable either, since they are rebuilt on a case by case basis — and sometimes even with the same exact settings, the different build time might cause different addresses to be used); and at the same time, ASLR only changes the addresses between two executions of the same program: unless the server uses spawned (not cloned!) processes, like inetd does (or rather did), then the address space between two requests on the same server will be just the same (as long as the server doesn’t get restarted).

At any rate, when using ASLR, the executables are no longer loaded all at the same address, so you either have to relocate the text (which is denied by NX) or you’ve got to use PIE, to make sure that the addresses are all relative to the specified base address. Of course, this also means that, at that point, all the code is going to be PIC, losing a register, and thus slowed down (a very good reason to use x86-64 instead of x86, even on systems with less than 4GiB of RAM).

Bottomline of the explanation: using the PIE component of the hardened toolchain is only useful when you have ASLR enabled, as that’s the reason why the whole hardened profile uses PIE. Without ASLR, you will have no benefit in using PIE, but you’ll have quite a few drawbacks (especially on the old x86 architecture) due to building everything PIC. And this is also the same reason why software that enables PIE by itself (even conditionally), like KDE 3, is doing silly stuff for most user systems.

And to make it even more clear: if you’re not using hardened-sources as your kernel, PIE will not be useful. This goes for vanilla, gentoo, xen, vserver sources all the same. (I’m sincerely not sure how this behave when using Linux containers and hardened sources).

If you liked this explanation that costed me some three days worth of time to write, I’m happy to receive appreciation tokens — yes this is a shameless plug, but it’s also to remind you that stuff like this is the reason why I don’t write structured documentation and stick to simple, short and to the point blogs.

Again PIC, and executables this time

At the end of November, I’ve read an interesting in-deep look at what PIC code is on Planet Debian; although it wasn’t really perfect in my opinion (as I commented on that post in particular it seems to give the impression that x86 text relocation option is better than AMD64 pc-relative approach, which in my opinion is not the case at all) it is a good explanation for most power users still unsure of what the problem is.

Now, as you can see on that post, the problem with PIC and text relocation is usually related to just shared libraries, since executables get loaded always at the same address in memory and thus require no relocation for their symbols. While this is the case for most common scenarios, there are cases when you want to relocate executables, too. For instance if you enabled randomisation of the load address as a security mitigation strategy. When that is the case, we say that PIE was enabled: Position Independent Executables.

in PIE-enabled systems, the address used to load an executable can vary between executions, which means that the references to jumps and data will not be at a fixed address in memory any longer, just like a shared library. Again you have the option of using text relocations (but since they conflict with other security mitigation strategies, it’s quite silly to do that) or you just build PIC code for the executable too. Now, this does cause a hit in performance since the symbols need to be resolved, once again, but it’s still better for the memory compared to text relocations, too.

It’s for this reason that I try my best to make sure that even code that is not going to be built into shared libraries is optimised so that it does not cause relocations either if possible.

Now, I’m sure that most users don’t need or want a PIE-enabled setup on their desktops; while the issue of how much mitigation strategies are useful on a desktop is a debatable topic that can go on and on for months, I just accept the fact that most users don’t want that, and that they should thus not be forced to use PIE for their software if they don’t want to. Indeed, it can cause extra dirty RSS pages for the processes using those executables, which is not something you wish especially on embedded systems.

On Gentoo, we don’t enable PIE by default, which means one would expect all the binaries installed by Portage to not be built with PIC code, but this does not seem to be the case, at least not always. While I was working on my size(1) replacement last night, I found that there are binaries in /usr/bin of my system with .data.rel.ro sections. The point is that .data.rel.ro only exists when the software is built with PIC, otherwise it makes no sense and the constants are emitted in .rodata, so I was not expecting any .data.rel.ro on /usr/bin.

So I started looking at the issue; while none of the files are Position-Independent Executables, some of them are built with -fPIC compiler flag, by mistake mostly, when added to properly build shared objects. Add to that the other two possible causes: --with-pic passed to the ./configure rather than leaving it to be figured by libtool, and -fPIC being used to compile static archives, so that they could work to link statically shared libraries too.

One of the packages doing this was Avahi, for which I sent a patch to Lennart, but it’s far from being the only one; I’m going to work on a few more patches in the next weeks, hopefully I’ll be able to reduce that. I also will have to work on an extra check for my bashrc system for the tinderbox, and consider adding that to Portage. Unfortunately just checking for .data.rel.ro is not enough: if a binary built with -fPIC does not have constants, but only variables, it will all be merged in .data, with the proper relocation entries. So to decide whether an ELF file is PIC or not one has to check the relocations, too, and even that is not going to be an absolute certainty, I’m afraid.

At any rate, I’m going to do my usual best to find out how to mitigate these problems, at least for the most common packages. As usual I strive to ensure that free software gets the best coverage even for these small little things that are important only once seen in the grand scheme of things. I guess I’ll have to find time to fetch a copy of Linkers and Loaders and read it so I can design some more interesting optimisation tests. Maybe one day I’ll be able to find enough time to start adding some of my common tests to binutils itself.

Should only libraries be concerned about PIC?

Position Independent Code is a technique used to create executable code that, as the name implies, is independent from the starting address where it is loaded (the position). This means that the pointers to data and functions in the code, as well as in the default value of pointers cannot be assumed to be always the same as the ones set after the build process in the executable file (or the library).

What this means in practical terms is that, as you can’t be sure how many and which libraries a program might load at runtime, libraries are usually loaded at dynamically-assigned addresses, thus the code need not to statically use one value as base address. When a shared library is loaded with a static base address (thus not using PIC), it has to be relocated by the runtime loader, and that causes changes to the .text section, which breaks the assumption that sections should be either writable or executable, but not both at the same time.

When using PIC, instead, the access to symbols (data and functions) is maintained by a global offset table (GOT), so the code does not need to be relocated, only the GOT and the pointers stored in the data sections. As you can guess, this kind of indirect access takes more time than the direct access that non-PIC code uses, and this is why a lot of people hate the use of PIC in x86 systems (on the other hand, shared libraries not using PIC not only breaks the security assumption noted above, making it impossible to use mitigation technologies like NX – PaX in Linux, W^X in OpenBSD – it also increase the memory usage of software as all the .text sections containing code will need to be relocated and thus duplicated by the copy-on-write).

Using hidden visibility is possible to reduce the performance hit caused by GOT access, by using PC-relative addressing (relative to the position on the file), if the architecture supports them of course. It does not save much for what concerns pointers in the data sections, as they will still need relocations. This is what causes arrays of strings to be written in .data.rel.ro sections rather than .rodata sections: the former gets relocated, the latter doesn’t, so is always shared.

So this covers shared libraries, right? Copy-on-write on shared libraries are bad, shared-libraries use PIC, pointers in data section on PIC code cause copy-on-write. But does it stop with shared libraries?

One often useful security mitigation factor is random base address choice for executables: instead of loading the code always at the same address, it is randomised between different executions. This is useful because an attacker can’t just start guessing at which address the program will be loaded. But applying this technique to non-PIC code will cause relocations in the .text section, which in turn will break another security mitigation technique, so is not really a good idea.

Introducing PIE (Position Independent Executable).

PIE is not really anything new: it only means that even executables are built with PIC enabled. This means that while arrays of pointer to characters are often considered fine for executables, and are written to .rodata (if properly declared) for non-PIC code, the problem with them reappears when using PIE.

It’s not much of a concern usually because the people using PIE are usually concerned with security more than performance (after all it is slower than using non-PIC code), but I think it’s important for software correctness to actually start considering that an issue too.

Under this light, not only it is important to replace pointers to characters with characters array (and similar), but hiding the symbols for executables become even more important to reduce the hit caused by PIC.

I’m actually tempted to waste some performance in my next box and start using PIE all over just to find this kind of problems more easily… yeah I’m masochist.

Some more about arrays of strings

If you want to read this entry, make sure you read my previous entry about array of strings too, possibly with comments.

Mart (leio), commented about an alternative to using const char* const for arrays of strings. While it took me a while to get it, he has a point. First let me quote his comment:

If you have larger strings, or especially if you have strings with wildly differing length, then you can also use a long constant string that contains all the strings and a separate array that stores offsets into that array.

For example:

static const char foo[] =
    "Foo" 
    "Longer string" 
    "Bar";

static const int foo_index[3] = { 0, 4, 18 };

and then you can just do

#include  
#define N_ELEMENTS(array) (sizeof((array)) / sizeof ((array)[0])) 

int main(void)
{
    int i;
    for(i = 0; i < N_ELEMENTS(foo_index); ++i)
    {
        printf("%sn", foo + foo_index[i]);
    }
}

There’s perl and other scripts to auto-generate such an array pair from a more readable form. Some low-level GNOME libraries have one, and Ulrich Dreppers DSO Howto does too (they differ, having different pros and cons).

Thought it might also be useful to someone if dealing with largely differently sized strings :)

Of course if they are similarly sized, then it’s better to use the method described here. Sometimes it even makes sense to split the array up into two parts – one for the small strings that could use this method, and one for the larger variable sized ones using this method or just a different size that fits them all without wasting muc

For a simple example, the code the code he published is actually quite pointless, as the method of using const char* const works just the same way, putting everything in .rodata.

When this makes sense is in shared objects when using PIC. In those cases, even const char* const doesn’t get to .rodata, but goes into .data.relro. This is due to the way that type of array is implemented. As we’ve seen for the const char* case, the strings are saved in the proper .rodata section, but then the pointers in the array were saved to .data as being non-constant.

When we’re using PIC, the address at which the sections is loaded is not known at compile time, it’s the ELF loader (or whatever loader is used for the current executable format) which has to replace the address of the variables. Which means that while for the purpose of the C language the pointers are constant, they are filled in at runtime, and thus cannot reside on the read-only shared pages. They can’t be shared also because two processes might load the same page at different addresses, this is how PIE is useful together with address randomisation in hardened setups.

When using PIC, the code emitted would be:

        .section        .rodata
.LC0:
        .string "bar"
.LC1:
        .string "foobar"
        .section        .data.rel.ro.local,"aw",@progbits
        .align 16
        .type   foo, @object
        .size   foo, 16
foo:
        .quad   .LC0
        .quad   .LC1

What this method tries to address is the constant 4KB of dirty RSS page that every process loading a given library built with PIC would have, just to keep the .data.rel.ro information. So Mart’s method uses up a bit more processing time (an extra load) compared to the array of arrays of characters, coming more or less to the same performances of const char *const, allowing for variable-sized strings without padding, but without wasting a 4KiB memory page, trading that for readability.

It’s not entirely a bad idea, actually, and I should consider it more for xine-lib, although I doubt I can spare the 4KB page for all of them, as having a structure to pass information like ID description and other stuff like that ends up propping up more stuff in .data.relro anyway.

On the other hand, .data.relro is a COW page, but a COW could be avoided by using prelink: prelinking gives a suitable default address for a shared objects, which in turn should fill the .data.relro sections with the right value already. Of course, prelink does not work with PIE and randomised addresses.

I think we should try to make use of array of arrays of characters whenever possible anyway ;) Faster and easier.