What does #shellshock mean for Gentoo?

Gentoo Penguins with chicks at Jougla Point, Antarctica
Photo credit: Liam Quinn

This is going to be interesting as Planet Gentoo is currently unavailable as I write this. I’ll try to send this out further so that people know about it.

By now we have all been doing our best to update our laptops and servers to the new bash version so that we are safe from the big scare of the quarter, shellshock. I say laptop because the way the vulnerability can be exploited limits the impact considerably if you have a desktop or otherwise connect only to trusted networks.

What remains to be done is to figure out how to avoid this repeats. And that’s a difficult topic, because a 25 years old bug is not easy to avoid, especially because there are probably plenty of siblings of it around, that we have not found yet, just like this last week. But there are things that we can do as a whole environment to reduce the chances of problems like this to either happen or at least avoid that they escalate so quickly.

In this post I want to look into some things that Gentoo and its developers can do to make things better.

The first obvious thing is to figure out why /bin/sh for Gentoo is not dash or any other very limited shell such as BusyBox. The main answer lies in the init scripts that still use bashisms; this is not news, as I’ve pushed for that four years ago, while Roy insisted on it even before that. Interestingly enough, though, this excuse is getting less and less relevant thanks to systemd. It is indeed, among all the reasons, one I find very much good in Lennart’s design: we want declarative init systems, not imperative ones. Unfortunately, even systemd is not as declarative as it was originally supposed to be, so the init script problem is half unsolved — on the other hand, it does make things much easier, as you have to start afresh anyway.

If either all your init scripts are non-bash-requiring or you’re using systemd (like me on the laptops), then it’s mostly safe to switch to use dash as the provider for /bin/sh:

# emerge eselect-sh
# eselect sh set dash

That will change your /bin/sh and make it much less likely that you’d be vulnerable to this particular problem. Unfortunately as I said it’s mostly safe. I even found that some of the init scripts I wrote, that I checked with checkbashisms did not work as intended with dash, fixes are on their way. I also found that the lsb_release command, while not requiring bash itself, uses non-POSIX features, resulting in garbage on the output — this breaks facter-2 but not facter-1, I found out when it broke my Puppet setup.

Interestingly it would be simpler for me to use zsh, as then both the init script and lsb_release would have worked. Unfortunately when I tried doing that, Emacs tramp-mode froze when trying to open files, both with sshx and sudo modes. The same was true for using BusyBox, so I decided to just install dash everywhere and use that.

Unfortunately it does not mean you’ll be perfectly safe or that you can remove bash from your system. Especially in Gentoo, we have too many dependencies on it, the first being Portage of course, but eselect also qualifies. Of the two I’m actually more concerned about eselect: I have been saying this from the start, but designing such a major piece of software – that does not change that often – in bash sounds like insanity. I still think that is the case.

I think this is the main problem: in Gentoo especially, bash has always been considered a programming language. That’s bad. Not only because it only has one reference implementation, but it also seem to convince other people, new to coding, that it’s a good engineering practice. It is not. If you need to build something like eselect, you do it in Python, or Perl, or C, but not bash!

Gentoo is currently stagnating, and that’s hard to deny. I’ve stopped being active since I finally accepted stable employment – I’m almost thirty, it was time to stop playing around, I needed to make a living, even if I don’t really make a life – and QA has obviously taken a step back (I still have a non-working dev-python/imaging on my laptop). So trying to push for getting rid of bash in Gentoo altogether is not a good deal. On the other hand, even though it’s going to be probably too late to be relevant, I’ll push for having a Summer of Code next year to convert eselect to Python or something along those lines.

Myself, I decided that the current bashisms in the init scripts I rely upon on my servers are simple enough that dash will work, so I pushed that through puppet to all my servers. It should be enough, for the moment. I expect more scrutiny to be spent on dash, zsh, ksh and the other shells in the next few months as people migrate around, or decide that a 25 years old bug is enough to think twice about all of them, o I’ll keep my options open.

This is actually why I like software biodiversity: it allows to have options to select different options when one components fail, and that is what worries me the most with systemd right now. I also hope that showing how bad bash has been all this time with its closed development will make it possible to have a better syntax-compatible shell with a proper parser, even better with a proper librarised implementation. But that’s probably hoping too much.

Tinderbox problems

My tinderbox for those who don’t know it, is a semi-automatic continuous integration testing tool. What it does is testing, repeatedly and continuously, the ebuilds in tree to identify failures and breakages in them as soon as they are available. The original scope of the testing was ensuring that the tree could be ready for using --as-needed by default – it became a default a few months ago – but since then it has expanded to a number of other testing conditions.

Last time I wrote about it, I was going to start reporting bugs related to detected overflows and so I did; afterwards there has been two special requests for testing, from the Qt team for the 4.7 version, and from the GNOME team for the GTK 2.22 release — explicitly to avoid my rants as it happens. This latter actually became a full-sweep tinderbox run, since there are just too many packages using GTK in tree.

Now, I already said before that running the tinderbox takes a lot of time, and thankfully, Zac and Kevin are making my task there much easier; on the other hand, right now I start to have trouble running it mostly for a pure economical matter. From one side, running it 247 for the past years starts to take its tall in power bills, from another, I start to need Yamato’s power to complete some job tasks, and in that time, the tinderbox is slowing down or pausing altogether. This, in turn, causes the tinderbox to go quite off-sync with the rest of Gentoo and requires even more time from there on.

But there are a few more problems that are not just money-related; one is that that the reason why I stopped posting regular updates about the tinderbox itself and its development is that I’ve been asked by one developer not to write about it, as he didn’t feel right I was asking for help with it publicly — but as it turns out, I really cannot keep doing it this way so I have to choose between asking for help or stop running the tinderbox at all. It also doesn’t help that very few developers seem to find the tinderbox useful for what it does: only a few people actually ask me to execute it for their packages when a new version is out, and as I said the GNOME team only seemed compelled to ask me about it for the sole reason of avoiding another rant from me rather than actually to identify and solve the related problems.

Not all users seem to be happy with the results, either. Since running the tinderbox also means finding packages that are broken, or that appear so, which usually end up being last-rited and scheduled for removal, it can upset users of those packages, whether they really need them or they just pretend and are tied to them just for legacy or historical reasons. Luckily, I found how to turn this to my advantage, which is more or less what I did already with webmin sometime ago and that is to suggest users to eitehr pick up the package themselves or, in alternative, to hire someone to fix it up and get it up to speed with QA practices. Up to now I hadn’t been hired to do that, but at least both webmin and nmh seem to have gotten a proxy-maintainer to deal with them.

Talking about problems, the current workflow with the tinderbox is a very bad compromise; since auto-filing bugs is a bit troublesome, I haven’t gone around filing them just yet; Kevin provided me with some scripts I’d have to try, but they have one issue, which is the actual reason why I haven’t gone filing them just yet: I skim through the tinderbox build logs through a grep process running within an emacs session; I check log by log for bugs I already reported (that I can remember); I check Bugzilla through Firefox in the background (it’s one of the few uses of Firefox I still have; otherwise I mostly migrated to Chromium months ago) to see if someone else reported the same bug, then if all fails, I manually file it through the use of templates.

This works out relatively well for a long series of “coincidences”: the logs are available to be read as my user, and Emacs can show me a browsable report of a grep; Bugzilla allows you to have a direct search query in Firefox search, and most of the time, the full build log is enough to report the bug. Unfortunately it has a number of shortcomings, for instance for emerge --info I have to manually copy and paste it from a screen terminal (also running on the same desktop in the background).

To actually add a self-reporting script to the workflow, what I’d be looking for is a way to launch it from within Emacs itself, picking out the package name and the maintainers from the log file itself (portage reports maintainer information at the top of the log nowadays, one thing that Zac implemented and made me very happy about). Another thing that would help would be bugzilla integration with Emacs to find the bugs; this may actually be something a Gentoo user who’s well versed in Emacs could help a lot about; adding a command “search on Gentoo Bugzilla” to Emacs so that it identifies the package the build log refers to, or the ebuild refers to, and report within itself the list of known open bugs for that package. I’m sure other developers using Emacs would find that very useful.

I also tried using another feature that Zac implemented, to my delight: compressed logs; having gzip-compressed logs makes the whole process faster for the tinderbox (logs are definitely smaller on disk, and thus require less I/O), and makes it easier to store older data, but unfortunately Bugzilla does not hand out transparently those logs to browsers; worse, it seems to double-compress them with Firefox even though they are properly provided a mime-type declaration, resulting in difficult-to-grok logs (I still have to compress some logs because they are just too big to be uploaded to Bugzilla). This is very unfortunate because scarabeus asked me before if I could get the full logs available somewhere; serving them not compressed is not going to be fun for any hosting service. Maybe Amazon S3 could turn out useful here.

Actually, there is also one feature that the tinderbox lost, as of lately: the flameeyestinderbox user on identi.ca, which logged, as a bot, all the running of the tinderbox, was banned again. The first time support unbanned it the same day, this time I didn’t even ask. An average of around 700 messages a day, and a single follower (me) probably don’t make it very palatable to the identi.ca admins. Jürgen suggested me to try requesting my own private StatusNet, but I’m not really sure I want to ask them to store even more of my data, it’s unlikely to be of any use for them. Maybe if I’ll ever end up having to coordinate more than one tinderbox instance, I’ll set up a StatusNet installation proper and let that one aggregate all the tinderboxes reporting their issues. Until a new solution has been found, I then fully disabled bti in the tinderbox code.

Anyway, if you wish to help, feel free to leave comments with ideas, or discuss them on the gentoo-qa mailing list (Kevin, you too please, your mail ended up in my TODO queue, and lately I just don’t have time to go through that queue often enough, sigh!); help with log analysis, bug opening and CCing is all going to make running the tinderbox much smoother. And that’s going to be needed if I won’t be able to run it 247. As for running the tinderbox for longer, if you run a company and you find the tinderbox worth it, you can decide to hire me to keep it running or donate/provide access to boxes powerful enough to run another instance of it. You can even set up your own (but it might get tricky to handle, especially bug reporting since if you’re not a Gentoo developer nor a power user with editbugs capabilities you cannot directly assign the filed bugs).

All help is appreciated, once again. Please don’t leave me here alone though…

Adding to the tree for once

You probably are used to me sending last rites for packages on behalf of QA, and thus removing packages (or in the past just removing packages, without the QA part). It often seems like my final contribution to Gentoo will be negative, in the sense that I remove more than I add.

Right now, though, I’ve been adding quite a few packages to the tree, so I’d like to say that no, I’m not one of those people who just like to remove stuff and who would like you to have a minimal system!

So with some self-advertisement (and a shameless plug while I’m at it…) I’d like to point out some of the things I’ve been working on.

The first package I’d like to separate is the newly-added app-emacs/nxml-libvirt-schemas: as I ranted on I wanted to be able to have syntax completion for the libvirt (a)XML configuration files (I still maintain they should have had either a namespace, a doctype or a libvirt-explicit document element name), so now that me and Daniel got the schemas to a point where they can be used to validate the current configuration files, I’ve added the package. It uses the source tarball of libvirt, with the intent of not depending on it, I’m wondering if it’d be better to use the system-installed Relax-NG files to create the specific Relax-NG Compact files, but that’ll have to wait for 0.7.5 anyway (which means, hopefully next week).

The second set of packages is obviously tied in with the Ruby-NG work and consists of a few new Ruby packages; some are brought in from the testing overlay I’ve built to try out the new packages, others have been brought in as dependency of packages being ported to the new eclasses, one instead (addressable) is a dependency of Typo that I didn’t install through Portage lately. I should probably add that I’m testing the new ebuilds “live” on this blog, so if you find problems with it, I’d be happy to receive a line about that.

The third set of packages instead relates to a work I’m currently doing for a long-time customer of mine, a company developing embedded systems; I won’t disclose much about the project, but I’m currently helping them build a new firmware, and I’m doing most of my job through the use of Portage and Gentoo’s own ebuilds. For this reason I have already added an ebuild for gdbserver (the small program that allows for remote debugging) that makes it trivial to cross-compile it for a different architecture, and I’m currently working on a gcc-runtime ebuild (which would also be pretty useful, if I get it right, for remote servers, like my own, to void having to install the full-blown GCC, but still have the libraries needed).

And tied to that same work you’ll probably find a few changes for cross-compilation both in and out of Gentoo, and some other similar changes; I have some GCC patches that I have to send upstream, and some changes for the toolchain eclass (right now you cannot really merge a GCJ cross-compiler, or even build one for non-EABI ARM).

So this is what I’m currently adding to the tree myself, I’m also trying to help the newly-cleaned up virtualization team to handle libvirt (and its backports) as well as the GUI programs, and I should be helping Pavel getting gearmand into shape if I had more time (I know, I know). And this is to the obvious side of the tinderbox work which is still going on and on (and identi.ca proves it since the script is denting away the status continuously), or the maintainer work for things like PAM (which I bumped recently and I need to double-check for uclibc).

So now, can you see why I might forget about things from time to time?

More explanations: why nano is Gentoo’s “default editor”

I say “default editor” in quotes because Gentoo really does not have a default editor, but rather a fallback-editor for whenever the users hadn’t chosen one themselves. And this is why stuff like the stage3, sudo, openrc and so on default to that when $EDITOR is unreliable (that is, until we can move to a full-fledged wrapper script).

Paul, in the previous post explained it quite well, albeit briefly, but since some other people asked me about that again, I’ll quote him and then go in a bit more of details:

… Third nano was kept as the default because it is small, straightforward, and easy to get out of when you don’t know it (let’s face it neither vi(m) nor (x)emacs is that easy to get out of). Fourth, by having the EDITOR/VISUAL variables point to nano (which is installed by stage2 as well as stage3) things would be working (although probably not optimal) out of the box. …

Indeed Paul speaks well: neither vi (or vim) nor emacs (or xemacs) is going to work. Why is that? Some people even goes as far to say that “it’s called *vi*sudo“ and “vi is the default Unix editor”. At the same time, I wouldn’t be surprised to hear “Emacs is the default GNU editor”. Well, let’s decompose it in two different problems.

Technical problem: the system set is messy already (and this is also “thanks” to my own pambase creation — which is, though, a solution to a previous problem). The packages that do enter the system set should be, whenever possible, small and with little dependencies. This includes, obviously, USE-optional dependencies, since when they are enabled, the dependencies also become part of the system set (like it happens with gnome and pambase). The dependency tree of nano is near invisible, while those of both vim and emacs are quite long (sure there are things like elvis, nvi, zile and jove, but this will bring us to the next point…).

Reality problem (or flame-retardant method): there is no way on Earth that any average number of people will be accepting to use the same editor as default, when taken from the list of “most popular” editors. Hey we cannot even find consensus among the three developers of lscube (I’m using Emacs – yeah I’m an Emacs user, burn me – and my colleagues using VIM and Eclipse). If we were to choose vim, all Emacsen users will complain, if we were to use GNU Emacs, both vim and XEmacs users will complain, and so on so forth. So we use nano, which is likely to make unhappy the most people, but all in the equal amount.

There’s a corollary to the reality problem above: if we were to choose any of the lightweight variants that I named above, or more of them, most likely the problem would be more or less the same, since almost all people would be pretty unhappy with the choice and would then unmerge that and install their editor of choice. But it would most likely upset the two “main factions” unequally, which would make one feel discriminated against.

Newbie problem: finally, there is another note: both Emacs and VI (and respective clones) aren’t exactly the most user-friendly editors. A newbie user who has no clue how to work in Gentoo is unlikely to guess at first sight how to use either of them, while nano is pretty much the easiest thing you might find around. You can fight as much as you want about powerfulness (Emacs and VI are obviously much more powerful than nano) and you can fight about relative easiness of use (:w versus C-x C-w) but nano is going to win over both of them in that regard.

So no, we’re not going to change the default anytime soon.

Per-buffer trailing whitespace killing

As you might know if you read my blog since a long time ago, I’m an Emacs user, or should I say GNU Emacs? Oh well that’s not important right now.

While some of the features, like the emacs-daemon support sometimes drive me crazy because they don’t work properly, there are quite a few that are very very useful; for instance, I can use su to access files as root, both locally and on the server, still using the same editor windows (sorry, frames) as the local files.

From time to time, though, I end up having to do some stuff that Emacs does not, by default, allow me to do. Yesterday, the problem that came up was deleting the trailing whitespace from all the lscube sources when opening the files (and also when saving). There is a function that allows to delete all trailing whitespace, delete-trailing-whitespace with very little fantasy, and it’s easy enough to add it to the find-file-hooks or write-file-hooks lists so that the whitespace is removed after opening or before saving. But unfortunately doing it for all files means applying it to lots of places where I don’t want it to happen.

I’ve searched around but I couldn’t find a way to do this conditionally; some major and minor modes, like the ebuild mode that gives emacs ways to deal with ebuild files, and a quick interface to ebuild commands (unpack, compile, and so on so forth), calls the function during the write phase so that the ebuilds don’t have trailing whitespace (repoman would check for that, so it’s useful to kill them as soon as they are found out). Unfortunately, doing so on a per-project basis seemed to me non-trivial.

First of all the .dir-locals.el file, that allows to provide special buffer-local variables for all the files under a given subtree, does not seem to allow executing special hooks code, it only allows to set variables. This is probably a good idea from a security point of view since this is automatically loaded on each file opening. But adding the function to the general hooks there, also does not work. So I decided simply to write my own hook, that can be configured through a buffer-local variable:

(setq do-delete-trailing-whitespace nil)
(make-variable-buffer-local 'do-delete-trailing-whitespace)

(defun maybe-delete-trailing-whitespace () ""
  (and do-delete-trailing-whitespace
       (progn
     (delete-trailing-whitespace)
     )))

(add-hook 'find-file-hooks 'maybe-delete-trailing-whitespace)
(add-hook 'write-file-hooks 'maybe-delete-trailing-whitespace)

Then I can just set the do-delete-traling-whitespace variable to t in the .dir-locals.el and the trick is served. Lovely.

And this is my first piece of useful elisp code!

User Services

A little context for those reading me; I’m writing this post a Friday night when I was planning to meet with friends (why I didn’t is a long story and not important now). After I accepted that I wouldn’t have a friendly night I decided to finish a job-related task, but unfortunately I’ve had some issues with my system. Somehow the latest radeon driver is unstable (well it’s an experimental driver after all), and it messes up compiz; in turn after a while either X crashes or I’m forced to restart it. This wouldn’t be a problem if the emacs daemon worked as expected. Since it doesn’t, I lose my workspace, with the open files, and everything related to that. It’s obnoxious. Since this happened four times already today I decided to take the night off, but I wasn’t in the mood for playing, so I settled for watching Pirates of the Carribean 2 in Blu-Ray, and write out some notes regarding the topics I wanted to write about for quite a while.

The choice of topic was related to the actual context I’ve just written above. As I said GNU Emacs is acting badly when it comes to the daemon. While the idea of the daemon would be to share buffers (open files) between ttys, network and graphical session, and to actually allow restarting those sessions without losing your settings, your data, and your open files, it’s pretty badly implemented.

A few months ago I reported that as soon as X was killed by anything (or even closed properly), the whole emacs daemon went down. After some debugging it turned out to be a problem with the handling of message logging. When the clients closed they sent a message to be logged by the emacs daemon, but since it had no way to actually write it to a TTY session, it died. That problem have been solved.

Now the problem appear to be just the same mirrored around: after X dies, the emacs daemon process is still running, but as soon as I open a new client, it dies. I guess it’s still trying to logging. As of today the problem still happens with the CVS version.

So anyway, this reminded me of a problem I already wanted to discuss with a blog: user-tied services. Classically, you had user-level software that i s started by an user and services that are started by the init system when the system starts up. With time, software became less straightforward. We have hotplugged services, that start up when you connect hardware like, for instance, a bluetooth dongle, and we have session software that is started when you login and is stopped once you exit.

Now, most of the session-related software is started when you log into X, and is stopped when you exit, sometimes, though, you want processes to persist between sessions. This is the case of emacs, but also my use case for PulseAudio since I want for it to keep going from before I login to before I shut down the system straight. There are more cases of similar issues but let’s start with this for now.

So how do we handle these issues? Well for PulseAudio we have an init script for the systemwide daemon. It works, but it’s not the suggested method to handle PulseAudio (on the other hand is probably the only way to have a multi-user setup with more than one user able to play sound, but that’s for another day too). For emacs, we have a a multiplexed init script that provides one service per user; a similar method is available for other service. Indeed, in my list of things to work on regarding PulseAudio there is to add a similar multiplexed init script to run per-user sessions of PulseAudio without using the system wide instance (should solve a bit of problems).

So the issue should be solved with he multiplexed per-user init script, no? Unfortunately, no. To be able to add the init scripts to the runlevels to be started, you need to have the root privileges. To start, stop and restart the services, you also need root privileges. While you can use sudo to allow users to run the start/stop commands to the init script, this is far from being the proper solution.

What I’d like to have one day is a way to have user-linked services, in three type of runlevels: always running (start when the machine starts up, stop when the system shuts down), after the first login (the services are started at the first login and never stop till shutdown), and while logged in (the services start at the first login, and stop when the last session logs out).

At that point it would be then possible to provide init scripts capable of per-user multiplexing for stuff like mpd too, so that users could actually have the flexibility f choosing how to run any software in the tree.

Unfortunately I don’t have any idea on how to implement this right now, but I guess I could just throw this in for the Summer of Code ideas.

The mailing lists problem

For a while I’ve been using GMane for almost all the mailing list I’m following. This made it much easier to deal with it because it mean that I didn’t have to download a local copy of all the messages and it also allowed me for a much cleaner access to archives. Unfortunately this has a downside, as it requires me to use an NNTP client, which is something that hasn’t been much cool to do lately.

In the last few few months I’ve used gnus as my client, but the problem with that is that it still needs a separate software from Evolution (which is my mail client0, and it had the nasty issue of not saving the read messages when emacs closed unexpectedly, say when X died, and because of a bug, emacs daemon died with it.

Now of course I could use mailing lists like most of the other people in the world do, by receiving them on my mail account, but I’d rather not fill my GMail “All Mail” folder with all the messages coming from mailing list, especially for when I have to access that data from an UMTS connection where I pay the traffic.

What other solutions are there for me? I’m considering the idea of doing what I did when there was a limit of 20MB on an email account from my provider, but no limit on the number of accounts, having one account per mailing list. Now the limit is 1GB but I haven’t been using those accounts in quite a long time. Even if they are available in IMAP when I’m on their connection they are only available through webmail from outside; not like that’s too much of an issue for me, since I need mailing list mostly when I’m not around hospitals or stuff like that, so I could just use that.

For this to work, though, I need a few tools working on IMAP, for instance I’d need a script that could expunge old archives, when the threads haven’t been updated in the last few months; I’ll also need a script that would automatically filter the mailing list as they arrive in Inbox (waiting with the IDLE command), since my provider does not allow me to filter the messages server-side. And such a script will have to run on Yamato since it has to be on their network.

Then there is the problem that Evolution saves its cache in my home directory, which is under RAID6; there is no need for the cache to be on my home, when XDG_CACHE_DIR is pointing to a private subdirectory in /var/cache. This includes the 200MB of SQLite database that is currently using. Does anybody know if there is a way to get Evolution to respect the XDG_CACHE_DIR variable?

DocBook 5 and Gentoo

After my post about DocBook 5 which I wrote from the hospital, with no connection to my Yamato to test most of the basis of it, today I started looking at working with DocBook 5 here, too.

This is mostly so I can resume working on the article I had to put on hold for surgery, but it’ll also work fine out as a way to start looking into what I was thinking of.

The first problem was to get good new Emacs to work with DocBook 5 files. Even the latest CVS version will still only support the old 4.x series “schemas”, and thus won’t support the new namespace aware DocBook 5. Luckily, I use Aquamacs on the laptop too so I already looked up how to support it. To make it more streamlined, for everybody me included, I decided to make it a step further.

If you’re an Emacs user in Gentoo and want to work with DocBook 5, just install app-emacs/nxml-docbook5-schemas and it will install the Relax-NG compact schema for DocBook 5 with support for XInclude (hurray for XInclude), and set up nxml so that it’s loaded properly, just restart Emacs (or find a way to make it reload nxml cache) and you’re set. Please note that nxml-mode decides which schema to use once you open the file in the buffer, so if you’re converting an old DocBook file to the new version, you’ll have to close the buffer and reopen the file.

Unfortunately xmlto does not seem to support DocBook 5 just yet, and it insists on looking for a DTD to validate the content. Even the latest version, which I bumped in the tree, fails to generate content out of DocBook 5. So I decided that xsltproc is good enough for me, and decided to make it possible to install the stylesheets as needed.

I’ve bumped app-text/docbook-xsl-stylesheets to the latest version and then created a new package, app-text/docbook-xsl-ns-stylesheets that use the new namespace aware stylesheets. Unfortunately for this to work, I also had to update the build-docbook-catalog script, that agriffis used to maintain… last touched in 2004. I suppose this is one of the things that should be quite more documented as to “Why? How? Who?” — there are instructions on how to release it but no documentation on why this is needed, how it was implemented and so on.

The script is very much abandoned and had quite a few hacks in it, one of which I removed because it wasn’t making any sense any longer (it was redirecting some old versions of stylesheets to the local copy, but not the most recent ones — such old versions don’t exist for xsl-ns for instance so it makes no sense to duplicate the redirections). Part of the reason why it was executed at every install/removal is not even relevant nowadays (stylesheets used to be installed with their version number, like it is done for DTDs, but this is no longer the case as they are not slotted), it should well be possible to just reduce the code needed to a simple function in the ebuild itself, rather than using such a bit-rotting script.

At any rate, I future proofed it a notch by making it possible to support xsl-xalan and xsl-saxon just by creating the proper ebuilds for the tree (I’m not interested in those yet so I didn’t create them).

I suppose one interesting thing would be to install Gentoo’s DTDs and Gentoo’s stylesheets through ebuilds too, but it’s unlikely to be easy or much useful as it is, so let’s not get there yet.

So anyway, I hope what I did today is going to be helpful to others to use DocBook 5 with Gentoo; if it was, appreciation tokens are quite appreciated, especially at convalescence time ;)

Parallel emerge versus parallel make

Since I now have a true SMP system, it’s obvious that I’m expected to run parallel make to make use of this. Indeed, I set in my make.conf to use -j8 where I was using -j1 before. This has a few problems in general and it’s going to take some more work to be properly supported.

But before I start to get to those problems, I’d like to provide a public, extended answer to a friend of mine who asked me earlier today why I’m not using Portage 2.2’s parallel emerge feature.

Well, first of all, parallel emerge is helpful on SMP system during a first install, a world rebuild (which is actually what I’m doing now) or in a long update after some time spent offline; it is of little help when doing daily upgrades, or when installing a new package.

The reason is that you can effectively only merge in parallel packages that are independent of each other. And this is not so easy to ensure, to avoid breaking stuff, I’m sure portage is taking the safe route and rather serialise instead of risking brokenness. But even this, expects the dependency tree to be complete. You won’t find it complete because packages building with GCC are not going to depend on it. The system package set is not going to be put in the DEPEND variables of each ebuilds, as it is, and this opens the proverbial vase to a huge amount of problems, in my view. (Now you can also look up an earlier proposal of mine, and see if it had sense then already).

When doing a world rebuild, or a long-due update, you’re most likely going to find long chains that can be put in parallel, which I sincerely find risky, but they don’t have to be. When installing a new package, on a system that is already well installed and worked on for a few weeks even, you’ll be lucky (or unlucky) to find two or three chains at all. If you’re doing daily updates, finding parallel chains is also unlikely, as the big updates (gnome, kde, …) are usually interdependent.

Although it’s a nice feature to have, I don’t find it’s going to help a lot on the long run, I think parallel make is the thing that is going to make a difference in the medium term.

Okay, so what are the problems with using -j8 for ebuilds then?

  • we express the preference in term of (GNU) make parameters, but not all packages in Portage are built with make, let alone GNU’s;
  • ebuilds that use a non-make-compatible build system will try to parse the MAKEOPTS variable to find out the number of parallel jobs to execute; this does not always work right because there can be other options, like -s (which I use) that might make parsing difficult;
  • even -s option can be useful to some non-make-compatible build systems, but having to translate every option is tremendously pointless and boring;
  • some people use a high number of jobs because they have multiple box building as a cluster, using distcc or icecream; these won’t help with linking though, or with non-compile jobs; forcing non-compile tasks to a single job is going to discontent people using SMP systems, using a job count based on network hosts for non-compile tasks is going to slow down people with single-cpu and multi-host setups;
  • some tasks are being serialised by ebuilds when the could be ran in parallel;

And this is not yet taking into consideration buildsystem-specific problems!

What should be doing then? Well, I think the first point to solve is the way we express the preferences. Instead of expressing it in term of raw parameters to make, we should express it in term of number of jobs, and of features. For instance, a future version of portage might have a make.conf like this:

BUILD_LOCAL_JOBS="8"
BUILD_NETWORK_JOBS="12"

BUILD_OPTIONS="ccache icecream silent"

And then an ebuild would call, rather than simply emake, two new scripts: elmake and enmake (local and network), which would expand to the right number of jobs, for make-compatible buildsystems, that is. For other build systems eclasses could deal with that by getting the number of jobs and the features from there.

More options might be translated this way without having to parse the make syntax in each ebuild, or in each eclass. The ebuilds could also declare a global or per-phase limit to jobs, or a RESTRICT="parallel-make", that would make Portage use a single job.

The last point is probably the most complex one. Robin already dealt with a similar issues in the .bdf to .pcf translation of fonts, and solved it by having a new package provide a Makefile with the translation rules, the conversion could then be parallelised by make, instead of being serialised by the ebuild. I think we should do something like this in quite a few cases; the first one I can think of is the elisp compile for emacs extensions, and I don’t know whether Python serialises or execute in parallel the bytecode generation when providing multiple files to py_compile. And this is just looking at two eclasses I know doing something similar to this. But also Portage’s stripping and compressing of files should probably be parallelised, where there are enough resources to do so locally.

I guess I have found yet another task I’ll spend my time on, especially once I’m back from the hospital.

System headers and compiler warnings

I wish to thank Emanuele (exg) for discussing about this problem last night, I thought a bit about it, checked xine-lib in this regard, and then decided to write something.

Not everybody might know this, but GCC, albeit reporting tons of useful warnings, especially in newer versions, together with a few warnings that can get easily annoying and rarely useful, like pointer signs, ignores system headers when doing its magic.

What does this mean? Well, when a system library install an header that would trigger warnings, they are by default ignored. This is useful because while you’re working on the application foo you won’t care what glibc devs did. On the other hand, sometimes these warnings are useful.

What Emanuele found was missing by ignoring the warnings caused by the system headers was a redefinition of a termios value in Emacs for Darwin (OS X). I checked for similar issues in xine-lib and found a few that I’ll have to fix soonish.

I’m not sure how GCC handles the warnings suppression, I think it’s worth opening a bug for them to change its behaviour here, though, as the redefinition is a warning caused by the program’s code, not by the system headers.

Now of course I hear the KDE users to think “but I do get warnings from the system headers”, in reference to KDE’s headers. Well, yes:

In file included from /usr/kde/3.5/include/kaboutdata.h:24,
                 from main.cpp:17:
/usr/qt/3/include/qimage.h: In member function ‘bool QImageTextKeyLang::operator<(const QImageTextKeyLang&) const':
/usr/qt/3/include/qimage.h:58: warning: suggest parentheses around && within ||

this warning I took from yakuake build, but it’s the same for every KDE package you merge, more or less. It’s a warning caused by an include, a library include, but in general the same rules apply for those.

Why is not not suppressed? The problem is in how the inclusion of the path happens. Which is probably my main beef against system headers warning suppression: it’s inconsistent.

By default the includes in /usr/include (and thus found without adding any -I switch) get their warnings suppressed. If a library (say, libmad) installs its headers there, it will get its warnings suppressed.

On the other hand if a library installs its headers in an alternative path, like /usr/qt/3/include in the example above, or a more common /usr/include/foobar, then it depends on how that directory is added to the path of searched directories. If it’s added through -I (almost every case) its warnings will be kept; they would be suppressed only if you use -isystem. Almost no software uses that option that, as far as I know, it’s gcc specific.

So whether a library will have the warnings caused by its headers suppressed or not, depends on the path. Nice uh? I don’t think so.

More work!