That innocent warning… or maybe not?

Anybody who ever programmed in C with a half-decent compiler knows that warnings are very important and you should definitely not leaving them be. Of course, there are more and less important warnings, and the more the compiler’s understanding of the code increases, the more warnings it can give you (which is why using -Werror in released code is a bad idea and why it causes so many headaches to me and the other developers when a new compiler is released).

But there are some times where the warnings, while not highlighting broken code, are indication of more trivial issues, such as suboptimal or wasteful code. One of these warnings was introduced in GCC 4.6.0, and relates to variables that are declared, set, but never read, and I dreamed of it back in 2008.

Now, the warning as it is, it’s pretty useful. Even though a number of times it’s going to be used to mask unused results warnings it can show code where a semi-pure function, i.e. one without visible side effects, but not marked (or markable) as such because of caching and other extraneous accesses, more about it in my old article if you wish, is invoked just to set a variable that is not used — especially with very complex functions, it is possible that enough time is spent processing for nothing.

Let me clarify this situation. If you have a function that silently reads data from a configuration file or a cache to give you a result (based on its parameters), you have a function that, strictly-speaking, is non-pure. But if the end result depends most importantly on the parameters, and not from the external data, you could say that the function’s interface is pure, from the caller perspective.

Take localtime() as an example: it is not a strictly-pure function because it calls tzset(), which – as the name leaves to intend – is going to set some global variables, responsible to identify the current timezone. While these are most definitely side effects, they are not the kind of side effects that you’ll be caring for: if the initialization doesn’t happen there, it will happen the next time the function is called.

This is not the most interesting case though: tzset() is not a very expensive funciton, and quite likely it’ll be called (or it would have been called) at some other point in the process. But there are a number of other functions, usually related to encoding or cryptography, which rely on pre-calculated tables, which might be actually calculated at the time of use (why that matters is another story).

Now, even considering this, a variable set but not used is usually not going to be troublesome in by itself: if it’s used to mask a warning for instance, you still want the side effect to apply, and you won’t be paying the price of the extra store since the compiler will not emit the variable at all.. as long as said variable is an automatic one, which is allocated on the stack for the function. Automatic variables undergo the SSA transformation, which among other things allows for unused stores to be omitted from the code.

Unfortunately, SSA cannot be applied to static variables, which means that assigning a static variable, even though said static variable is never used, will cause the compiler to include a store of that value in the final code. Which is indeed what happens for instance with the following code (tested with both GCC 4.5 – which does not warn – and 4.6):

int main() {
  static unsigned char done = 0;

  done = 1;
  return 1;
}

The addition of the -Wunused-but-set-variable warning in GCC 4.6 is thus a godsend to identify these, and can actually lead to improvements on the performance of the code itself — although I wonder why is GCC still emitting the static variable in this case, since, at least 4.6, knows enough to warn you about it. I guess this is a matter of missing an optimization, nothing world-shattering. What I was much more surprised by is that GCC fails to warn you about one similar situation:

static unsigned char done = 0;

int main() {

  done = 1;
  return 1;
}

In the second snippet above, the variable has been moved from function-scope to unit-scope, and this is enough to confuse GCC into not warning you about it. Obviously, to be able to catch this situation, the compiler will have to perform more work than the previous one, since the variable could be accessed by multiple functions, but at least with -funit-at-a-time it is already able to apply similar analysis, since it reports unused static functions and constant/variables. I reported this as bug #48779 upstream.

Why am I bothering writing a whole blog post about a simple missed warning and optimization? Well, while it is true that zeroed static variables don’t cause much trouble, since they are mapped to the zero-page and shared by default, you could cause huge waste of memory if you have a written-only variable that is also relocated, like in the following code:

#include 

static char *messages[] = {
  NULL, /* set at runtime */
  "Foo",
  "Bar",
  "Baz",
  "You're not reading this, are you?"
};

int main(int argc, char *argv[]) {
  messages[0] = argv[0];

  return 1;
}

Note: I made this code for an executable just because it was easier to write down, and you should think of it as a PIE so that you can feel the issue with relocation.

In this case, the messages variable is going to be emitted even though it is never used — by the way it is not emitted if you don’t use it at all: when the static variable is reported as unused, the compiler also drops it, not so for the static ones, as I said above. Luckily I can usually identify problems like these while running cowstats, part of my Ruby-Elf utilities if you wish to try it, so I can look at the code that uses it, but you can guess it would have been nicer to have already in the compiler.

I guess we’ll have to wait for 4.7 to have that. Sigh!

Maybe not-too-neat Apache tricks; Part 1

In general, I’m not an expert sysadmin; my work usually involves development rather than administration, but as many other distribution developers, I had to learn system administration to make sure that the packages do work on the users’ systems. This gets even messier when you deal with Gentoo and its almost infinite amount of combinations.

At any rate, I end up administering not only my local systems, but also two servers (thanks to IOS Solutions who provides xine with its own server for the site and bugzilla). I started using lighttpd, but for a long series of circumstances I ended up moving to Apache (mostly content negotiation things). I had to learn the hard way about a number of issues — luckily security was never involved.

My setup moved from a single lighttp instance to one Apache that kept running two static websites, one Bugzilla and one Typo instances, to two Apache on two servers, one running a static website and a Bugzilla instance, the other running a few static websites and a Typo instance via passenger. The latter is more or less what I have now.

From one side, Midas is keeping up the xine website (static, generated via XSLT after commit and push); from the other, Vanguard – the one I pay for – keeps this blog my website and a few more running. I used to have a gitweb instance (and Gitarella before that), but I stopped providing the git repositories myself, much easier to push them to Gitorious or GitHub as needed.

The static websites use my own generator for which I still have to find a proper license. Most of these sites are mine or simply of friends of mine, but with things changing a bit for me, I’m going to start offering that as a service package for my paying customers (you have no idea how many customers would just be interested in having a simple, static page updated once in a few months… as long as it looked cool).

But since I have, from time to time, to stop Apache to make changes to my blog – or in the past Passenger went crazy and Apache stopped from answering to requests at all – I’m not very convinced about running the two alongside for much longer. I’ve then decided it was a good idea to start figuring out an alternative approach; the solution I’m thinking of requires the use of two Apache instances on the same machine; since I cannot use different ports for them (well, I could run my blog over 443/SSL but I don’t think that would be that good of an idea for the read-only situation), I’ve now requested a second IP address (the current vserver solution I’m renting should support up to 4), and I’ll run two instances with that.. over the two different IP addresses.

Now, one of the nice things of splitting the two instances this way is that I don’t even need ModSecurity on the instance that only serves the static sites; while they are not really as static as a stone (I make use of content negotiation to support multiple languages on the same site, and mod_rewrite to set the forcing), there is no way that I can think of that any security issue is triggered while serving them. I could even use something different from Apache to serve them, but the few advanced features I make use of don’t make it easy to switch (content negotiation is one, another is rewritemaps to recover moved/broken URLs). And obviously, I wouldn’t need Passenger either.

But all the other modules? Well, those I’d need; and since by default they are actually shared modules (I have ranted about that last November), loading two copies of them means duplicating the .data.rel and the other Copy on Write sections. Not nice. So I finally bit the bullet and, knowing that Apache upstream allows using them as builtin, I set out to find if the Gentoo packaging allows for that situation. Indeed it does, but by mishandling the static USE flag which made it quite harder to find out. After enabling that one, disabling the mem_cache, file_cache and cache modules (that are not loaded by default but are still built, and would be built-in when using the static USE flag), and restarting Apache, the process map looked much better, as now the apache2 processes have quite less files open (and thus a much neater memory map).

One thing that is interesting to note: right now, I’ve not been using mod_perl for Bugzilla because of the configuration trouble; one day I might actually try that. Possibly with a second Apache instance on Midas, open only on SSL, with a CACert certificate.

Now it might very well be possible that you were to need a particular module only in one case, such as mod_ssl to run a separate process for an SSL-enabled Apache 2 instance… in that case, one possible solution, even though not extremely nice, is to use the EXTRA_ECONF trick that I already described.. in this case, you could create a /etc/portage/env/www-servers/apache file with this content:

export EXTRA_ECONF="${EXTRA_ECONF} --enable-ssl=shared"

On a separate note, I think one of the reasons why our developers let the default be dynamic modules is more related to the psychology of calling it “shared”. It makes it sound like it’s wasting memory when you have multiple processes using a “non-shared” module.. when in reality you’re creating much more private memory mappings with the shared version. Oh well.

Unfortunately, as it happens, the init system we have in place does not allow for more than one Apache system to be running; it really requires different configuration files and probably a new init script, so I’ll have to come back to this stuff in the next days, for the remaining parts.

There are though three almost completely unrelated notes that I want to sneak in:

  • I’m considering a USE=minimal (or an inverse, default-enabled, USE=noisy) for pambase; it would basically disable modules such as pam_mail (tells you if you have unread mail in your queue — only useful if you have a local MDA), pam_motd (gives you the Message of the Day of the system) and pam_tally/pam_lastlog (keep track of login/su requests). The reason is that these modules are kept loaded in memory by, among others, sshd sessions, and I can’t find any usefulness in them for most desktop systems, or single-user managed servers (I definitely don’t leave a motd to myself).
  • While I know that Nathan complained to me about that, I think I start to understand why the majority of websites seem to stick with www or some other third-level domain: almost no DNS service seem to actually allow for CNAME to be used on the origin record (that is, the second-level domain); this means that you end up with the two-levels domain to point directly to an IP, and changing a lot of those is not a fun task, if you’re switching the hosting from one server to another.
  • CACert and Google Chrome/Chromium don’t seem to get along at all. Not only I’ve been unable to tell it to accept the CACert root certificate, but while trying to generate a new client certificate with it, the page is frozen solid. And if I try to install it after generating it with Firefox, well… it errors out entirely.

Shared libraries worth their while

This is, strictly speaking, a non Gentoo-related post; on the other hand, I’m going to introduce here a few concepts that I’ll use in a future post to explain one Gentoo-specific warning, so I’ll consider this a prerequisite. Sorry if you feel like Planet Gentoo should never go over the technical non-Gentoo work, but then again, you’re free to ignore me.

I have, in the past, written about the need to handle shared code in packages that install multiple binaries (real binaries, not scripts!) to perform various tasks, which end up sharing most of their code. Doing the naïve thing, using the source code in all of them, or the slightly less naïve thing, building a static library and linking it to all the binaries, tend to increase the size of the commands on disk, and the memory required to fully load them in memory. In my previous post I noted a particularly nasty problem with the smbpasswd binary, that was almost twice the size because of unused code injected by the static convenience library (and probably even more, given that I never went down for to hide the symbols and clean them up).

In another post, I also proposed the use of multicall binaries to handle these situations; the idea behind multicall binaries is that you end up with a single program, with multiple “applets”; all the code is merged into a single ELF binary object, and at runtime the correct path is taken to call the right applet, depending on the name used to call up the binary. It’s not extremely easy but not even impossible to get right, so I still suggest that as main alternative to handle shared code, when the shared code is bigger in size than the single applet’s code.

This does not solve the Samba situation though: the code of the single utilities is still big enough than having a single super-package will not make it very manageable, and a different solution has to be devised. In this case you end up having to choose between the static linking (naïve approach) or using a private, shared object. An easy way out here is trying to be sophisticated, and always go with the shared object approach; it definitely might not be the best option.

Let me be clear here: shared objects are not a panacea to the shared code problems. As you might have heard already, using shared objects is generally a compromise: you relax problems related to bugs and security vulnerability, by using a shared object, so that you don’t have to rebuild all the software using that code — and most of the times you also share read-only memory to reduce the memory consumption of a system — at the expense of load time (the loader has to do much more work), sometime execution speed (PIC takes its toll), and sometimes memory usage, as counter-intuitive as that might sound, given that I just said that they reduce memory consumption.

While the load time and execution speed tolls are pretty much immediate to understand, and you can find a lot of documentation about them on the net, it’s less obvious to understand the share-memory, waste-memory situation. I wrote extensively about the Copy-on-Write problem so if you follow my blog regularly you might have guessed the problem already at this point, but it does not fill in all the gaps yet, so let me try to explain how this compromise works.

When we use ELF objects, part of the binary file itself are shared in memory across different processes (homogeneous or heterogeneous). This means that only those parts that would not be modified from the ELF files can be shared. This usually includes the executable code – text – for standard executables (and most code compiled with PIC support for shared objects, which is what we’re going to assume), and part (most) of the read-only data. In all cases, what breaks the share for us is Copy-on-Write, as that will create private copies of the pages to the single process, which is why writeable data is nothing we care about when choosing the code-sharing strategy (it’ll mostly be the same whether you link it statically or via shared objects — there are corner cases, but I won’t dig into them right now).

What is that talking about homogeneous or heterogeneous processes above? Well, it’s a mistake to think that the only memory that is shared in the system is due to shared objects: read-only text and data for an ELF executable file are shared among processes spawned from the same file (what I called and will call homogeneous processes). What shared object accomplish with memory is sharing between processes spawned by different executables, but loading the same shared objects (heterogeneous processes). The KSM implementation (no it’s not KMS!) in the current versions of the Linux kernel allows for something similar, but it’s a story so long that I won’t really bother count it in.

Again, the first approach to shared objects might make you think that moving whatever amount of memory from being shared between homogeneous processes to be shared between heterogeneous processes is a win-win situation. Unfortunately you have to cope with data relocations (which is a topic I wrote about extensively): a constant pointer is read-only when the code is always loaded at a given address (as it happens with most standard ELF executables), but it’s not when the code can be loaded at an arbitrary address (as it happens with shared objects): in the latter case it’ll end up in the relocated data section, which follows the same destiny as the writeable data section: it’s always private to the single process!

*Note about relocated data: in truth you could ensure that the data relocation is the same among different processes, by using either prelinking (which is not perfect especially with modern software, which is more and more plugin-based), or methods like KDE’s kdeinit preloading. In reality, this is not really something you could, or should, rely upon because it also goes against the strengthening of security applied by Address Space Layout Randomization.*

So when you move shared code from static linking to shared objects, you have to weight in the two factors: how much code will be left untouched by the process, and how much will be relocated? The size utility from either elfutils or binutils will not help you here, as it does not tell you how big the relocated data section is. My ruby-elf instead has an rbelf-size script that gives you the size of .data.rel.ro (another point here: you only care about the increment in size of .data.rel.ro as that’s the one that is added as private: .data.rel would be part of the writeable data anyway). You can see it in action here:

flame@yamato ruby-elf % ruby -I lib tools/rbelf-size.rb /lib/libc.so.6
     exec      data    rodata     relro       bss     total filename
   960241      4507    359020     12992     19104   1355864 /lib/libc.so.6

As you can see from this output, the C library has some 950K of executable code, 350K of read-only data (both will be shared among heterogeneous processes) and just 13K (top) of additional relocated memory, compared to static linking. _Note: the rodata value does not only include .rodata but all the read-only non-executable sections; the value of exec and rodata roughly corresponds of what size calls text).

So how is knowing how much relocated data useful in assessing how to deal with shared code? Well, if you build your shared code as shared object and analyse it with this method (hint: I just implemented rbelf-size -r to reduce the columns to the three types of memory we have in front of us), you’ll have a rough idea of how much gain and how much waste you’ll have for what concern memory: the higher the shared-to-relocated ratio, the better results you’ll have. Having an infinite ratio (when there is no relocated data) it’s the perfection.

Of course the next question is what do you do if you have a low ratio? Well there isn’t really a correct answer here: you might decide to bite the bullet and go in the code to improve the ratio; cowstats from the Ruby-Elf suite helps you to do just that; it can actually help you reducing your private sections as well, as many times you have mistake in there, due to missing const declarations. If you have already done your best to reduce the relocations, then your only chance left is to avoid using a library altogether; if you’re not going to improve your memory usage by using a library, and it’s something internal only, then you really should look into using either static linking or, even better, multicall binaries.

Impootant Notes of Caution

While I’m trying to go further on the topic of shared objects than most documentation I have read myself on the argument, I have to point out that I’m still generalising a lot! While the general concept are as I put them down here, there are some specific situations that change the table making it much more complex: text relocations, position independent executables, PIC overhead, are just some of the problems that might arise while trying to apply these general ideas over specific situations.

Still trying not to dig too deep on the topic right away, I’d like to spend a few words about the PIE problem, which I have already described and noted in the blog: when you use Position Independent Executables (which is usually done to make good use of the ASLR technique), you can discard the whole check of relocated code: almost always you’ll have good results if you use shared objects (minus complications added by the overlinking, of course). You still would have the best results by using multicall binaries if the commands have very little code.

Also, please remember that using shared objects slows down the loading process which means that if you have a number of fire-and-forget commands, which is something not too unusual in the UNIX-like environments, you will probably have best results with multicall, or static linking, than with shared objects. The shared memory is also something that you’ll probably get to ignore in that case, as it’s only worth its while if you normally keep the processes running for a relatively long time (and thus loaded into memory).

Finally, all I said refers to internal libraries used for sharing code among commands of the same package: while most of the same notes about performance-wise up- and down-sides holds true for all kind of shared objects, you have to factor in the security and stability problems when you deal with third-party (or third-party-available) libraries — even if you develop them yourself and ship them with your package: they’ll still be used by many other projects so you’ll have to handle them with much more care, and they should really be shared.

Plugins aren’t always a good choice

I’ve been saying this for quite a while, probably one of the most on-topic post has been written a few months ago but there are some indications about it in posts about xine and other again.

I used to be an enthusiast about plugin interfaces; with time, though, I started having more and more doubts about their actual usefulness — it’s a tract I really much like in myself, I’m fine with reconsidering my own positions over time, deciding that I was wrong; it happened before with other things, like KDE (and C++ in general).

It’s not like I’m totally against the use of plugins altogether. I only think that they are expensive in more ways than one, and that their usefulness is often overstated, or tied to other kind of artificial limitations. For instance, dividing a software’s features over multiple plugins makes it easier for the binary distributions to package them, usually: they only have to ship a package with the main body of the software, and many for the plugins (one per plugin might actually be too much so sometimes they might be grouped). This works out pretty well for both the distribution and, usually, the user: the plugins that are not installed will not bring in extra dependencies, they won’t take time to load and they won’t use memory for either code nor data. It basically allows binary distribution to have a flexibility to compare with Gentoo’s USE flags (and similar options in almost any other source-based distribution).

But as I said this comes with costs, that might or might not be worth it in general. For instance, Luca wanted to implement plugins for feng similarly to what Apache and lighttpd have. I can understand his point: let’s not load code for the stuff we don’t have to deal with, which is more or less the same reason why Apache and lighttpd have modules; in the case of feng, if you don’t care about access log, why should you be loading the access load support at all? I can give you a couple of reasons:

  • because the complexity of managing a plugin to deal with the access log (or any other similar task) is higher than just having a piece of static code that handles that;
  • because the overhead of having a plugin loaded just to do that is higher than that of having the static code built in and not enabled into configuration.

The first problem is a result of the way a plugin interface is built: the main body of the software cannot know about its plugins in too specific ways. If the interface is a very generic plugin interface, you add some “hook locations” and then it’s the plugin’s task to find how to do its magic, not the software’s. There are some exceptions to this rule: if you have a plugin interface for handling protocols, like the KIO interface (and I think gvfs has the same) you get the protocol from the URL and call the correct plugin, but even then you’re leaving it to the plugin to deal with doing its magic. You can provide a way for the plugin to tell the main body what it needs and what it can do (like which functions it implements) but even that requires the plugins to be quite autonomous. And that means also being able to take care of allocating and freeing the resources as needed.

The second problem is not only tied to the cost of calling the dynamic linker dynamically to load the plugin and its eventual dependencies (which is a non-trivial amount of work, one has to say), also by the need for having code that deals with finding the modules to load, the loading of those modules, their initialisation, keeping a list of modules to call at any given interface point, and two more points: the PIC problem and the problem of less-than-page-sized segments. This last problem is often ignored, but it’s my main reason to dislike plugins when they are not warranted for other reasons. Given a page size of 4KiB (which is the norm on Linux for what I know), if the code is smaller than that size, it’ll still require a full page (it won’t pack with the rest of the software’s code areas); but at least code is disk-backed (if it’s PIC, of course), it’s worse for what concerns variable data, or variable relocated data, since those are not disk-backed, and it’s not rare that you’d be using a whole page for something like 100 bytes of actual variables.

In the case of the access log module that Luca wrote for feng, the statistics are as such:

flame@yamato feng % size modules/.libs/mod_accesslog.so
   text    data     bss     dec     hex filename
   4792     704      16    5512    1588 modules/.libs/mod_accesslog.so

Which results in two pages (8KiB) for bss and data segments, neither disk-backed, and two disk-backed pages for the executable code (text): 16KiB of addressable memory for a mapping that does not reach 6KiB, it’s a 10KiB overhead, which is much higher than 50%. And that’s the memory overhead alone. The whole overhead, as you might guess at this point, is usually within 12KiB (since you got three segments, and each can have at most one byte less than page size as overhead — it’s actually more complex than this but let’s assume this is true).

It really doesn’t sound like a huge overhead by itself, but you have to always judge it compared to the size of the plugin itself. In the case of feng’s access log, you got a very young plugin that lacks a lot of functionality, so one might say that with the time it’ll be worth it… so I’d like to show you the size statistics for the Apache modules on the very server my blog is hosted. Before doing so, though, I have to remind you one huge difference: feng is built with most optimisations turned off, while Apache is built optimised for size; they are both AMD64 though so the comparison is quite easy.

flame@vanguard ~ $ size /usr/lib64/apache2/modules/*.so | sort -n -k 4
   text    data     bss     dec     hex filename
   2529     792      16    3337     d09 /usr/lib64/apache2/modules/mod_authn_default.so
   2960     808      16    3784     ec8 /usr/lib64/apache2/modules/mod_authz_user.so
   3499     856      16    4371    1113 /usr/lib64/apache2/modules/mod_authn_file.so
   3617     912      16    4545    11c1 /usr/lib64/apache2/modules/mod_env.so
   3773     808      24    4605    11fd /usr/lib64/apache2/modules/mod_logio.so
   4035     888      16    4939    134b /usr/lib64/apache2/modules/mod_dir.so
   4161     752      80    4993    1381 /usr/lib64/apache2/modules/mod_unique_id.so
   4136     888      16    5040    13b0 /usr/lib64/apache2/modules/mod_actions.so
   5129     952      24    6105    17d9 /usr/lib64/apache2/modules/mod_authz_host.so
   6589    1056      16    7661    1ded /usr/lib64/apache2/modules/mod_file_cache.so
   6826    1024      16    7866    1eba /usr/lib64/apache2/modules/mod_expires.so
   7367    1040      16    8423    20e7 /usr/lib64/apache2/modules/mod_setenvif.so
   7519    1064      16    8599    2197 /usr/lib64/apache2/modules/mod_speling.so
   8583    1240      16    9839    266f /usr/lib64/apache2/modules/mod_alias.so
  11006    1168      16   12190    2f9e /usr/lib64/apache2/modules/mod_filter.so
  12269    1184      32   13485    34ad /usr/lib64/apache2/modules/mod_headers.so
  12521    1672      24   14217    3789 /usr/lib64/apache2/modules/mod_mime.so
  15935    1312      16   17263    436f /usr/lib64/apache2/modules/mod_deflate.so
  18150    1392     224   19766    4d36 /usr/lib64/apache2/modules/mod_log_config.so
  18358    2040      16   20414    4fbe /usr/lib64/apache2/modules/mod_mime_magic.so
  18996    1544      48   20588    506c /usr/lib64/apache2/modules/mod_cgi.so
  20406    1592      32   22030    560e /usr/lib64/apache2/modules/mod_mem_cache.so
  22593    1504     152   24249    5eb9 /usr/lib64/apache2/modules/mod_auth_digest.so
  26494    1376      16   27886    6cee /usr/lib64/apache2/modules/mod_negotiation.so
  27576    1800      64   29440    7300 /usr/lib64/apache2/modules/mod_cache.so
  54299    2096      80   56475    dc9b /usr/lib64/apache2/modules/mod_rewrite.so
 268867   13152      80  282099   44df3 /usr/lib64/apache2/modules/mod_security2.so
 288868   11520     280  300668   4967c /usr/lib64/apache2/modules/mod_passenger.so

The list is ordered for size of the whole plugin (summed up, not counting padding); the last three positions are definitely unsurprisingly, although it surprises me the sheer size of the two that are not part of Apache itself (and I start to wonder whether they link something in statically that I missed). The fact that the rewrite module was likely the most complex plugin in Apache’s distribution never left me.

As you can see, almost all plugins have vast overhead especially for what concerns the bss segment (all of them have at least 16 bytes used, and that warrants a whole page for them: 4080 bytes wasted each); the data segment is also interesting: only the two external ones have more than a page worth of variables (which also is suspicious to me). When all the plugins are loaded (like they most likely are right now as well on my server) there are at least 100KiB of overhead; just for the sheer fact that these are plugins and thus have their own address space. Might not sound like a lot of overhead indeed, since Apache is requesting so much memory already, especially with Passenger running, but it definitely doesn’t sound like a good thing for embedded systems.

Now I have no doubt that a lot of people like the fact that Apache has all of those as plugins as they can then use the same Apache build across different configurations without risking to have in memory more code and data than it’s actually needed, but is that right? While it’s obvious that it would be impossible to drop the plugin interface from Apache (since it’s used by third-party developers, more on that later), I would be glad if it was possible to build in the modules that come with Apache (given I can already choose which ones to build or not in Gentoo). Of course I also am using Apache with two configurations, and for instance the other one does not use the authentication system for anything, and this one is not using CGI, but is the overhead caused by the rest of modules worth the hassle, given that Apache already has a way to not initialise the unused built-ins?

I named above “third party developers” but I have to say now that it wasn’t really a proper definition, since it’s not just what third parties would do, it might very well be the original developers who might want to make use of plugins to develop separate projects for some (complex) features, and have different release handling altogether. For uses like that, the cost of plugins is often justifiable; and I am definitely not against having a plugin interface in feng. My main beef is when the plugins are created for functions that are part of the basic featureset of a software.

Another unfortunately not uncommon problem with plugins is that the interface might be skewed by bad design, like the case was (and is) for xine: when trying to open a file, it has to pass through all the plugins, so it loads all of them into memory, together with the libraries they depend on, to ask each of them to test the current file; since plugins cannot really be properly unloaded (and it’s not just a xine limitation) the memory will still be used, the libraries will still be mapped into memory (and relocated, causing copy on write, and thus, more memory) and at least half the point of using plugins has gone away (the ability to only load the code that is actually going to be used). Of course you’re left with the chance that an ABI break does not kill the whole program, but just the plugin, but that’s a very little advantage, given the cost involved in plugins handling. And the way xine was designed, it was definitely impossible to have third-party plugins developed properly.

And to finish off, I said before that plugins cannot be cleanly unloaded: the problem is not only that it’s difficult to have proper cleanup functions for plugins themselves (since often the allocated resources are stored within state variables), but also because some libraries (used as dependency) have no cleanup altogether, and they rely (erroneously) on the fact that they won’t be unloaded. And even when they know they could be unloaded, the PulseAudio libraries, for instance, have to remain loaded because there is no proper way to clean up Thread-Local Storage variables (and a re-load would be quite a problem). Which drives away another point of using plugins.

I leave the rest to you.

The PIE is not exactly a lie…

Update (2017-09-10): The bottom line of this article changed since the 8 years it was posted, quite unsurprisingly. Nowadays, vanilla kernel has a decent ASLR and so everyone does actually have advantages in building everything as PIE. Indeed, Arch Linux and probably most other binary distributions do exactly that. The rest of the technical description of why this is important and how is still perfectly valid.

One very interesting misconception related to Gentoo, and especially the hardened sub-profile, is related to the PIE (Position-Independent Executable) support. This is probably due to the fact that up to now the hardened profile always contained PIE support, and since it relates directly to PIC (Position-Independent Code) and PIC as well is tied back to hardened support, people tend to confuse what technique is used for what scope.

Let’s start with remembering that PIC is a compilation option that produces the so-called relocatable code; that is, code that is valid no matter what base address it is loaded at. This is a particularly important feature for shared objects: to be able to be loaded by any executable and still share the code pages in memory, the code needs to be relocatable; if it’s not, a text relocation has to happen.

Relocating the “text” means changing the executable code segment so that the absolute addresses (of both functions and data — variables and constants) is correct for the base address the segment was loaded at. Doing this, causes a Copy-on-Write for the executable area, which among other things, wastes memory (each process running will have to have its private copy of the executable memory area, as well as the variable data memory area). This is the reason why shared objects in almost any modern distribution are built relocatable: faster load time, and reduced memory consumption, at the cost of sacrificing a register.

An important note here: sacrificing a register, which is something needed for PIC to keep the base address of the loaded segment, is a minuscule loss for most architectures, with the notable exception of x86, where there are very few general registers to use. This means that while PIC code is slightly (but not notably) slower for any other architecture, it is a particularly heavy hit on x86, especially for register-hungry code like multimedia libraries. For this reason, shared objects on x86 might still be built without PIC enabled, at the cost of load time and memory, while for most other architectures, the linker will refuse to produce a shared object if the object files are not built with PIC.

Up to now, I said nothing about hardened at all, so let me introduce the first relation between hardened and PIC: it’s called PaX in Hardened Linux, but the same concept is called W^X (Write xor eXecute) in OpenBSD – which is probably a very descriptive name for a programmer – NX (No eXecution) in CPUs, and DEP (Data Execution Prevention) in Windows. To put it in layman terms, what all these technologies do is more or less the same: they make sure that once a memory page is loaded with executable code, it cannot be modified, and vice-versa that a page that can be modified cannot be executed. This is, like most of the features of Gentoo Hardened, a mitigation strategy, that limits the effects of buffer overflows in software.

For NX to be useful, you need to make sure that all the executable memory pages are loaded and set in stone right away; this makes text relocation impossible (since they consists of editing the executable pages to change the absolute addresses), and also hinders some other techniques, such as Just-In-Time (JIT) optimisation, where executable code is created at runtime from an higher, more abstract language (both Java and Mono use this technique), and C nested functions (or at least the current GCC implementation, that makes use of trampolines, and thus require executable stack).

Does any of this mean that you need PIC-compiled executables (which is what PIE is) to make use of PaX/NX? Not at all. In Linux, by default, all executables are loaded at the same base address, so once the code is built, it doesn’t have to be relocated at all. This also helps optimising the code for the base case of no shared object used, as that’s not going to have to deal with PIC-related problems at all (see this old post for more detailed information about the issue).

But in the previous paragraph I did write some clue as to what the PIE technique is all about; as I said, the reason why PIE is not necessary is that by default all executables are loaded at the same address; but if they weren’t, then they’d be needing either text relocations or PIC (PIE), wouldn’t they? That’s the reason why PIE exists indeed. Now, the next question would be, how does PIE relate to hardened? Why does the hardened toolchain use PIE? Does using it make it magically possible to have a hardened system?

Once again, no, it’s not that easy. PIE is not, by itself, neither a security measure nor a mitigation strategy. It is, instead, a requirement for the combined use of two mitigation strategy, the first is the above-described NX idea (which rules out the idea of using text relocations entirely), while the second is is ASLR (Address Space Layout Randomization). To put this technique also in layman terms, you should consider that a lot of exploit require that you change the address a variable points to, so you need to know both the address of that variable, and the address to point it to; to find this stuff out, you can usually try and try again until you find the magic values, but if you randomize the addresses where code and data are loaded each time, you make it much harder for the attacker to guess them.

I’m pretty sure somebody here is already ready to comment that ASLR is not a 100% safe security measure, and that’s absolutely right. Indeed here we have to make some notes as to which situation this really works out decently: local command exploits. When attacking a server, you’re already left to guess the addresses (since you don’t know which of many possible variants of the same executable the server is using; two Gentoo servers rarely have the same executable either, since they are rebuilt on a case by case basis — and sometimes even with the same exact settings, the different build time might cause different addresses to be used); and at the same time, ASLR only changes the addresses between two executions of the same program: unless the server uses spawned (not cloned!) processes, like inetd does (or rather did), then the address space between two requests on the same server will be just the same (as long as the server doesn’t get restarted).

At any rate, when using ASLR, the executables are no longer loaded all at the same address, so you either have to relocate the text (which is denied by NX) or you’ve got to use PIE, to make sure that the addresses are all relative to the specified base address. Of course, this also means that, at that point, all the code is going to be PIC, losing a register, and thus slowed down (a very good reason to use x86-64 instead of x86, even on systems with less than 4GiB of RAM).

Bottomline of the explanation: using the PIE component of the hardened toolchain is only useful when you have ASLR enabled, as that’s the reason why the whole hardened profile uses PIE. Without ASLR, you will have no benefit in using PIE, but you’ll have quite a few drawbacks (especially on the old x86 architecture) due to building everything PIC. And this is also the same reason why software that enables PIE by itself (even conditionally), like KDE 3, is doing silly stuff for most user systems.

And to make it even more clear: if you’re not using hardened-sources as your kernel, PIE will not be useful. This goes for vanilla, gentoo, xen, vserver sources all the same. (I’m sincerely not sure how this behave when using Linux containers and hardened sources).

If you liked this explanation that costed me some three days worth of time to write, I’m happy to receive appreciation tokens page likes — yes this is a shameless plug, but it’s also to remind you that stuff like this is the reason why I don’t write structured documentation and stick to simple, short and to the point blogs.

For elves, size matters

This weekend I was supposed to take it slow, relax, and spend it rebuilding my strength, playing and stuff. At the end I wasn’t able to do much of that, and instead I spent it looking at projects I left dangling for a while, like my very own Ruby-Elf.

One beef I always had with ELF analysis tools was with the size command, that is used to get some quick statistics about the internal size distribution of an ELF file. I was interested in that because it’s tightly related to what cowstats tells you already. Indeed if you look at the following output, the data column is the total size of Copy on Write sections for the file.

% size /usr/lib/libxml2.so              
   text	   data	    bss	    dec	    hex	filename
1425219	  35772	   5368	1466359	 165ff7	/usr/lib/libxml2.so

You’d expect me to be happy that there is already a tool that extracts that information out of compiled ELF files, but I really am not because this does not actually provide enough granularity for what my targets are, usually.

First of all, you have to learn that the three names for the columns are not the actual names of the involved sections, even though .text, .data and .bss are three common ELF sections. The actual meanings are for bss, the sum of the sizes of all the allocated sections which are not mapped to the zero page (like .bss and .tbss), for data the sum of the sizes of all the allocated writeable sections, and for text the sum of the sizes of all the remaining allocated sections. This means that .rodata sections are counted in text rather than data, which can easily be confusing.

I blame this confusion also on the size(1) man page that comes from binutils, since it contains this:

Here is an example of the Berkeley (default) format of output from size:

“`
$ size –format=Berkeley ranlib size
text data bss dec hex filename
294880 81920 11592 388392 5ed28 ranlib
294880 81920 11888 388688 5ee50 size
“`

This is the same data, but displayed closer to System V conventions:

“`
$ size –format=SysV ranlib size
ranlib :
section size addr
.text 294880 8192
.data 81920 303104
.bss 11592 385024
Total 388392

size :
section size addr
.text 294880 8192
.data 81920 303104
.bss 11888 385024
Total 388688
“`

This would make you expect that the three columns map directly to the three named sections. Instead if you actually used the System V format for size on libxml2 you’d get a result much different:

% size --format=SysV /usr/lib/libxml2.so
/usr/lib/libxml2.so  :
section             size      addr
.gnu.hash          13012       456
.dynsym            43464     13472
.dynstr            35235     56936
.gnu.version        3622     92172
.gnu.version_r       128     95800
.rela.dyn          76872     95928
.rela.plt           3216    172800
.init                 24    176016
.plt                2160    176040
.text            1011448    178208
.fini                 14   1189656
.rodata           132592   1189696
.eh_frame_hdr      19508   1322288
.eh_frame          83924   1341800
.ctors                16   3526016
.dtors                16   3526032
.jcr                   8   3526048
.data.rel.ro       28056   3526080
.dynamic             448   3554136
.got                 720   3554584
.got.plt            1096   3555304
.data               5412   3556416
.bss                5368   3561856
.gnu_debuglink        28         0
Total            1466387

It counts all the sections one by one rather than grouping them. I’m still not sure on how it decides whether to print or not a given section, I’d have to look at the actual code from binutils to say that. It’s not counting just the allocated sections (so the ones that count in for the vmsize once the file is loaded to spawn a process), or otherwise .gnu_debuglink wouldn’t be in the list, but it’s also not showing all of the sections, since .shstrtab is not there at all. Interestingly enough, if you use the eu-size program that comes with Elfutils rather than the version coming from GNU Binutils, you’ll see that .gnu_debuglink is not in the list, which suggests a bug in Binutils.

But it’s not just this inconsistency in naming that bothers me, it’s more that the output information is almost entirely pointless to make actual guesses about the behaviour of a program, although that’s the most common usage of size. The problem is that you cannot distinguish between increases in the size of the code, or in the size of the constant tables it uses. It also does not allow to easily know whether the amount of data is or not mitigated by prelink.

The SysV-style output is even more useless since if you want to know the full data of the sections you can just use the readelf -S command, which reports among other things the size and the address it is loaded at, which is just what SysV style does. The only difference is that you cannot change radix and that it does not give you a total, but even then I don’t see what’s the point after all. By the way if you think that readelf -S is too difficult to read, try eu-readelf -S, it’s nicer.

So at the end of the day what am I supposed to do? Well at this point I’m writing my own script, rbelf-size, which shows a different set of headers, which allows me to actually distinguish between the different sections so I can judge the changes in software and libraries in different versions and with different patches. I hope to have it available on Ruby-Elf’s repository by the end of next week, maybe it can be useful to others too.

I’m in yur ALSA… killing your dirties

dscn1341.jpg

Well, maybe not right now.

Since I doubt you’d be able to understand what I mean from the lolcat-speech title, let me try to summarise it in a language that nears a lot more what people actually speak (yes I know it’s still going to be too technical for some of the readers, but I guess that cannot really be helped).

Last night I couldn’t sleep, for a series of reason, not last that to make sure I could implement some stuff for my job while waiting for the actual definitive specs, I took three coffee cups, which while making me feel very nice, stops me from sleeping; not so nice when your neighbours woke you up two days in a row fighting, but I can manage.

Since at the time I was waiting for the chroot to complete some builds so I could check and submit a few more bugs (the count of “My Bugs” search on bugzilla now reaches 1200 bugs and has some reserve too), I decided to try something different. I already have been adding to my git repositories changes to a few libraries I contributed to in the past enough buildsystem so that --no-undefined is added, so last night I decided to go with doing some work on ALSA upstream repositories.

I already had checked out three of the repositories when 1.0.18 was added to the tree, since I had to fix an --as-needed issue and decided to just go on and submit all the patches to upstream for merge, this time I checked out alsa-lib, added --no-undefined and then started some analysis with ruby-elf tools cowstats and missingstatic, as well as removed a few compiler warnings, just to make sure I wouldn’t be distracted by faux problems.

The result should now be that the alsa-lib and alsa-plugins libraries have a few dirty pages less, and that the code is a bit more solid than before, with added static and const modifiers where needed. It wasn’t much of a work, but I forgot once again to add -s to the git commits so I had to rewrite history to get the Signed-off-by header to all the commits; if somebody knows how to set git per-repository to always use -s when committing, I’d be very glad.

On the other hand, this task shown me that cowstats still had and has some problems, in particular, it lacked a way to separate .data.rel from .data.rel.ro sections data. This is important to distinguish between the two since .data.rel.ro is fully prelinkable, which means after a prelink it would always be loaded from the disk without further relocation, while .data could still cause copy on write because it can be changed at runtime even after relocation.

This is even further understood by noticing that shared objects built with GCC under Linux have .data, .data and .data.rel.ro, but no .data.rel which is instead merged back into .data itself. But because of this the “real” data count in cowstats is entirely out of reality. I’ll have to rewrite that part most likely.

Anyway, I’ve done my best and hopefully tomorrow more of my patches will be merged in, so that alsa-lib’s dirty pages get reduced again. Unfortunately even after my changes, with all the plugins enabled, and in the worst case scenario, libasound.so will go on requiring more than 28KiB of dirty pages per process (minus forks and various preloads). Which is not nice at all. Prelinking can reduce the dirty pages removing these 28KiB (which are all of .data.rel.ro), and then it would just require a couple of pages.

There is one question though that now is driving me nuts: hasn’t Nokia worked on ALSA for their N800 tablets? I know alsa-plugins has a Maemo plugin (which I also cleaned up a bit last night, as it had quite a few hacks on the autotools side, and an unwarranted use of pow() instead of using left shift), but I’d expect Nokia to know better about having so many dirty pages…

Anyway, for all the remaining users, I strongly suggest you look into removing some of the plugins that ship with ALSA, like the iec958 plugin if you don’t use digital pass-through. By cutting down the amount of built-in plugins you should be able to reduce sensibly the memory that alsa and the applications using alsa would be using on your system.

-I also wonder why didn’t I add an USE flag to disable the alisp feature- Sorry, of course I wouldn’t be able to find an alisp USE flag if I check the output of emerge -pv alsa-utils. D’oh!. Why does ALSA need a LISP interpreter anyway?

Parameters handling and COWs.

One common problem while fixing copy on write pages is finding an array of structures that hold parameters name and their value, usually loaded up either from network or from configuration files. One of the classical declarations of those structures is:

struct {
    char name[16];
    int value;
} parameters[20] = {
  { "foo", 34 },
  { "bar", 50 }
};

I put a character array in the struct so to show that using them does not make everything magically working.

What is wrong with this structure? Well it is not constant, so it gets written to .data, which is a COW section. You can’t really make it constant either as the value of the parameters has to change, if you’re using it to set the parameters for the program to work with. As long as the user is not going to change the default parameters, it’s not going to use extra memory, luckily, but if the user changes even just one parameter, COW is triggered.

This is actually quite common, it’s even used by Xorg drivers like radeon, often with extra flags like type, and unions, and so on.

Most people would react to this by using pointers to strings so that those get to .rodata and just their pointer is added to the parameters object. In the best case, those pointer will end up in .data, when using PIC, they’ll be added to .data.rel; the difference between the two is, again, that the first is only copied over when changing the values from the default, while the latter is copied over every time the executable or shared library is loaded (unless using prelink, if that works at all actually).

My proposed solution to that is simply to use two different objects. Somehow one limitation of C shows off here: being unable to write in columns ;)

static const char param_names[16][20] = { "foo", "bar" };
char param_values[20] = { 34, 50 };

If there is more to the names, one might use a structure instead, think for instance of this example, where the parameter can be either an integer, a float or a string:

static const struct {
    char name[16];
    enum { PARAM_INT, PARAM_FLOAT, PARAM_STRING } type;
} parameters[20] = {
  { "foo", PARAM_INT },
  { "bar", PARAM_FLOAT },
  { "baz", PARAM_STRING }
};

union {
    int intval;
    float floatval;
    char *stringval;
} param_values[20] = {
  { .intval = 34 },
  { .floatval = 50.0 },
  { .stringval = "default" }
};

Now in these last two examples the metadata about parameters (name, and type) can be written off to .rodata, while the values are saved in .data; at the same time, saved for the last one with the string, the parameter values will be shared as long as the defaults are not changed. Nice uh?

The only thing to remember is to make sure the index of the two arrays is the same. This is why I think it’s a pity that there is no way to write in columns in C, it would make maintaining this code quite easier.

When the list of parameters to maintain is quite big, the best way to tackle down the problem is to write a script that translate a table of metadata and default values into C code. It should be quite easy to do with most tools like Perl or Ruby or AWK. This way there is little to no overhead in maintaining the two arrays separate, and the memory use will be optimised.

As for the reason why I used an union in the last example, I have seen more than once structures defined the same way of that struct, carrying all three kinds at once. Those structures tend to waste a lot of space. Once you’re certain that only one of the three entries is used for a given parameter, using an union also limit the amount of memory the table use. This is one thing unions have been designed for, after all.

Dirty RSS and compiled pci IDs

I was looking around for targets for improvements on COW memory, and I’ve stumbled across one of Xorg’s private mappings:

flame@enterprise ~ % sudo smem `pidof X` | grep pcidata
     196 kb        0 kb      196 kb   /usr/lib64/xorg/modules/libpcidata.so
      68 kb        0 kb       68 kb   /usr/lib64/xorg/modules/libpcidata.so
     536 kb      344 kb        0 kb   /usr/lib64/xorg/modules/libpcidata.so

It’s quite some amount of memory used here. The problem is that libpcidata contains a lot of strings that need to e relocated. What those strings are? Well, simply a C compiled version of the pci.ids file.

Of course it’s faster to access pci.ids data at runtime after compiling it in C code at buildtime, but is it a good thing to do?

I admit I don’t know Xorg that well, but as far as I can see, the PCI scan happens when you actually start Xorg, as it’s used to identify the video card. So it’s saving time during Xorg’s bootup to use the compiled version rather than the system version. On the other hand it’s using a lot of memory during the rest of runtime, for no apparent good reason (to me at least).

I know the code is mostly historical, and I also know that requiring people using Xorg to install pciutils might be a bit farfetched, but really, I’ve read that on Planet Freedesktop too, if I recall correctly, we’re continuing reinventing the wheel.

My take on all this is that we should get pci.ids, usb.ids and whatever else data we can in one single big package, released by one single big project, and then let all the package rely on that single copy of it.

If any other distributor is reading this, let me know what you think, maybe we could get this done by all asking together to the authors.

Random notes about xine plugins interface

Here it comes a new technical blog, and this time it’s also related to xine,s o you get a double important post.

xine, like many other modern software, uses a plug-in architecture. This is pretty nice especially for a multimedia application, as it allows even binary distributions to provide granularity like Gentoo does with the USE flags, and allows the processes not to load all the backend libraries that aren’t needed at once.

Or does it?

Unfortunately the architecture of xine has a few problems. One of these is that at the moment, almost every plugin is loaded at application startup, rather than as needed. The first time it’s comprehensible, as xine need to produce a cache file, but the other times?

The first problem was with settings: a lot of plugins had settings to be loaded at startup, but luckily Thibaut took care of this and implemented cache for settings too. This solves one problem.

Then there is another problem: most of the times we can’t trust the extension of a file, so the default is to test the demuxer by content. Testing the demuxer by content requires loading all the demuxers and try them one by one. This is expensive. The reverse detection method is good for files, as we first try to identify the demuxer by the extension, if that doesn’t work, and just then, we do load all the demuxers and try them one by one. For stream, we can’t trust the extension at all.

So what are the problems here? Well first off even if we decided to use the reverse method as default, we’ll be losing all the advantage at the first file that fails the extension detection or at the first stream; then there is the problem that we end up even testing demuxers that, well, can’t just play files. This is the case of the cdda demuxer, for instance, or some network demuxers.

What we need then is to solve two problems:

  • detect streams without using the by-content detection, this can be done once I’ll have time to complete reading mime-types from input (HTTP Content-type, but also Content-type saved in extended attributes for local files); this should solve the problem of running content-detection for every stream; most stream sources will report correct Content-type, as most likely would gnome-vfs and other VFS implementations;
  • get a way to tell xine-lib that a given demuxer works only with a certain input plugin (and vice-versa, as the dvd demuxer will always use the same mpeg demuxer).

This solves a bit of problems by itself, by avoiding loading unneeded plugins, but the architecture of xine’s plugin system is still rough.

That happens because of the way the data is currently exported by the plugin itself, through a global exported structure, containing the a literal string, and the pointer to a static function used to allocate and fill the next structure. This structure contains pointers, the pointers need to be relocated, so it’s in .data.rel.ro, which gets COW’d once the plugin is loaded and relocated; as prelink does not work with plugins, we can’t even rely on that to solve the problem for us.

With the COW, all the xine plugins gets 4KB of memory used up per plugin, per process.

What I would like instead would be a new xine_plugin_get_info() which allocates in heap the structure, and fills it. It will be slower most likely, but it will a) avoid the difference between structures loaded from cache (which are allocated in heap), and those loaded from plugins (which are mapped in .data.rel.ro) b) avoid the 4KB waste per plugin per process, by just allocating the right amount of memory needed to keep the structure in memory.

If I have extra time I might actually try this on a branch, it might be a decent improvement after all.