Archiving

One of the requests at the past VDD when I shown some VLC developers my Ruby-Elf toolsuite was for it to access archive files directly. This has been in my wishlist as well for a while, so I decided to start working on it. To be precise, I started writing the parser (and actually wrote almost all of it!) on the Eurostar that was bringing me to London from Paris.

Now you have to understand that while the Wikipedia page is a quite good source of documentation for the format itself, it’s not exactly complete (it doesn’t really have to be). But it’s mostly to the point: there are currently two main variants for ar files: the GNU and the BSD variants. And the only real difference between the two is in the way long filenames are handled. And with long filenames I mean filenames that are long at least 16 characters.

The GNU format handles this with a single index file that provides the names for all the files, and provide you with an offset instead of the proper name in the header data, whereas the BSD format provides you with a length, and prepend the filename to the actual file data. Which of the two options should be the best is well up for debate.

I already knew of the difference, so I did code in support for both variants, but of course while on the train I only had access to the GNU version of ar which is present in Binutils, so I only wrote a test for that. Now that I’m back at the office (temporarily in Los Angeles, as it seems like I’ll be moving to London soon enough), I have a Mac to my side and I decided to prepare the files for testing with its ar(1) which is supposedly BSD.

I say supposedly because something strange happened! The long filename code is hit by two of my testcases: one is using an actual object file which happens to have a name longer than 16 characters, the other is an explicit long (very long) filename — 86 characters! But what happens is that Apple’s version of ar writes the filename as 88 characters, padding it with two final null bytes. At first I thought I got something wrong in the format, but if I use bsdtar on Linux, which provide among other formats support for the bsd ar format, it writes down properly 86 bytes without any kind of null termination.

More interestingly, the other archive, where the filename is just 20 characters long, is written the exact same way by both libarchive and Apple’s ar!

Should we be lesser GNUs?

I have sincere doubts when people insist on calling Gentoo a GNU/Linux distribution, mostly because we’re well versed in supporting non-GNU Free Software alternatives, even if sometimes it’s quite difficult to do so. One of the cases which I tried to tackle before and I failed was the ability to use bsdtar as system implementation of tar(1). Lacking a complete system like Debian’s alternative, and left with mostly-custom, identical scripts for eselect, supporting simply tar is trivial, supporting multiple implementations and tools really isn’t.

But why am I taking again the idea out right now? It’s not like the Google Summer of Code is approaching (well, there is Google Code-In but this is definitely not the kind of thing that could be done there. Instead, the reason why I approach the notion once again is that GNU failed us again: version 1.24 of GNU tar is bugged as the Fallout 3 vaults.

In a previous iteration I complained about GNU software not working properly with the new version of GCC (4.5). One of the software that failed to work as intended was, as you might have already guessed, tar. Indeed the new, stronger buffer overflow detection in GCC and GLIBC made sure that tar 1.22 and 1.23 (already released at the time) tried to run a strncpy() between two adjacent buffers. It was a known problem from before, but with the latest (for the time) toolchain updates, it became a failure situation.

You’d expect that version 1.24, released last week, would have applied the trivial fix needed. The answer is a resounding “nope”. D’oh!

Okay so whatever, they didn’t fix something, but it’s not the end of the world, is it? I’m still keeping around patches for Linux-PAM that I wrote years ago. Sure it is a bit worse because it’s the GNU project utilities refusing to update for the changes in the GNU project toolchain, but it’s nothing extreme.

What make it much worse is that this release actually introduces not one, but two bugs that hit us very hard; one relates to the very important -C switch; the other seem to relate instead to the KDE tarballs (that most likely are created with an older version of GNU tar itself). Somehow, it feels like the project could use a much more extensive testsuite.

Considering this and a number of other issues, I’m sincerely wondering if I shouldn’t be working again on making it feasible for users to choose bsdtar as default implementation (but still allow for side-by-side installation, since I’m sure there will be stuff not working properly with bsdtar at first), and set that up on the tinderbox as well. One of the nicest part of it, is that libarchive allows for in-process decompression and extraction (which GNU tar can’t do), and it would allow to replace two packages (GNU tar and cpio) with a single one, that is already required by both GNOME and KDE.

Refreshing Gentoo Work

After a few months spent mostly working on lscube, I’ve been ignoring most of the non-basic Gentoo work for a while. Between last night (before going to sleep) and this morning, though, I started the catch-up work.

First of all, Tim released a new version of libarchive that required some testsuite fixing (and I haven’t noticed the first time around that it now wrongly uses -Werror since I have -Wno-error in my CFLAGS to avoid time wastes). Thankfully, Tim is a dream upstream to work with and the most important fix is already upstreamed.

Then I have been active in the Ruby area since I both needed to work on the new Typo and a few more packages are bundled with Typo’s code (you’re going to find a git branch with no third party code bundled in my git repositories when I’m done), and got some new tasks to work on.

The gems problem, which is hopefully going to be solved after the Summer of Code, is for now just being sidestepped; indeed, I’ve ended up adding the will_paginate library with a fake gemspec which actually works pretty nicely, without having the usual side effects of gems (no object files installed, no extra dependency on rake, no installed testsuite) and with the obvious advantages from the tarballs, including working testsuites (and tested), documentation built on request and installed, as well as examples. This, and probably a few more before end of the month, package will be tested directly here on the blog if you’re interested on the outcome.

I still have a few things that I’m supposed to have done in the past month among which figures updating calibre (I’ve been using an old version on OSX up to now), figuring out why libcdio-0.81 freezes down during install, and stuff like that. Hopefully I’ll also be able to find time for those now that my job is a bit more safe than it was before.

When upstream lacks a testsuite

… you can make your own, or try at least.

I maintain in portage a little ebuild for uif2iso; as you probably already know, the foo2iso tools are used t convert various type of proprietary disk images from Windows proprietary software into ISO9660 images that can be used under Linux. Quite obviously, making unit testing out of such a tool is pointless, but regression testing at tool level might actually work. Unfortunately for obvious reasons upstream does not ship testing data.

Not exactly happy with this, I started considering what solution I had, and thus my decision: if upstream does not ship with any testsuite, I’ll make one myself. The good thing with ebuilds is that you can write what you want for the test in src_test. I finally decided to build an UIF image using MagicISO on my Windows XP vbox, download it together with the MD5 digest of the files I’d put in it conditionally to the test USE flag, and during the test phase convert it to ISO, extract the files, and check that the MD5 digest is correct.

Easier said than done.

To start with I had some problem deciding what to put on the image; of course I could use some random data, but I thought that at that point I could at least make it funny for people to download the test data, if they wanted to look at it. My choice fell on just finding some Creative Commons-licensed music and use a track from that. After some looking around, my choice went to the first track of Break Rise Blowing by Countdown on Jamendo.

Now, the first track is not too big so it’s not a significant overhead to download the test data, but there is another point here: MagicISO has three types of algorithms used: default, best compression and best speed; most likely they are three compression levels in lzma or something along those lines, but just to be safe I’d just put all three of those to the test. The resulting file with the three UIF images and the MD5 checksums was less than 9MB, so an acceptable size.

At that point, I started writing the small testsuite, and the problem started: uif2iso always returns 1 at exit, which means you can’t use || die or it would always die. Okay good enough, just check that the file was created. Then you have to take the files out, nothing is that easy when you got libarchive that can extract ISO files like they were tarballs; just add that as a dependency with the test USE flag enabled, a bit of overhead but at least I can easily extract the data to test.

It seems instead that the ISO file produced by uif2iso is going to be a test for libarchive instead, since the latest release fails to extract it. I mailed Tim and I hope he can fix it up for the next release (Tim is fantastic with this, when 2.5.902a was released, I ended up finding a crasher on a Portage-generated binpkg, I just had to mail it to him, and in the next release it was fixed!). The ISO file itself seems fine, since loop-mounting it works just fine. The problem is that I know no other tool that can extract ISO images quickly and without having to command it file by file (iso-read from libcdio can do that, it’s just too boring); if somebody has suggestions I’m open to them.

This is the fun that comes out of writing your own test cases I guess, but on the other hand I think it’s probably a good idea to keep the problematic archives around, if they have no problems with licenses (Gentoo binpkgs might, since they are built from sources and you’d have to distribute the sources with the binaries, which is why I wanted some Creative Commons licensed content for the images), since that allows you to test stuff that broke before to ensure it never breaks again. Which is probably the best part of unit, integration and system testing: you take a bug that was introduced in the past, fix it and write a test so that, if it is ever reintroduced it would be caught by the tests rather than by the users again.

Has anybody said FATE ?

Importing someone else’s code

I already maintain, in the tree, an ebuild for the Nimbus GTK+ Engine, that is used by Sun Microsystems on OpenSolaris and Solaris Express. I liek it, it’s quite nice and it was a pleasure to package as it worked almost out of the box (I had to patch a few things but it’s fine now).

Today I looked at the JDS directory where Sun keep the sources of the pieces of code the use on their modified Gnome environment, for the only reason to find something that might be useful to import in Gentoo.

There are a few things I could probably package in my overlay for the sake of it (Sun’s backgrounds and GDM themes — we have RedHat’s and Mandrake’s artwork packages after all), but my attention was caught by an “ospm” package, which seems to provide an interface to the status of printers.

One thing I love of Free Software is that you can make use of software that has been developed by someone else from a different target without problems. Unfortunately this was not the case. While I hoped for something that could be usable under Linux, ospm itself has a few Solaris dependencies too many (that’s why it’s called the OpenSolaris Printing Manager, I suppose).

Anyway, this made me think of a couple more issues. I have said before I’d like some eselect tool that could allow to switch different tools on a Gentoo system without having to write one module each, for instance for tar (so that one could choose between libarchive’s bsdtar and GNU tar), or whois clients.

Debian already has a software called “alternatives” that does that. I wondered before if it’s possible to import that one in Gentoo and make use of that, rather than reinventing the wheel again… but as far as I know it’s tightly related to Debian’s dpkg (just like eselect is unlikely to be ever used outside Gentoo).

More interesting instead is the fact that Cygwin ships with an “alternatives” package, that is not based on the one from Debian, but rather from the one in Fedora (chkconfig package). We already do import a few things from Fedora and RedHat, and I know Donnie did package a few more things from them before. I really should try that out and see if it works. It certainly makes sense not to have to worry about reinventing the wheel once again if somebody did so again, even if it breaks the “Gentoo way” of using eselect (but one could implement an eselect plugin over alternatives of course).

So to keep it short to have time to do some work, I’ll certainly look to see if, rather than having to invent something new, we can reuse something that fellows developers developed and tried already.

Back on Enterprise

So I’m finally back on Enterprise, the new office is almost done, I just have to finish cleaning my stuff to bring it back into the office.

Since I’m back on Enterprise, I started Specto and looked at which pages were updated, and I got quite a few new releases for the packages I maintain:

  • Linux-PAM released the first of the 1.0 series, 1.0.0, which I’ve already added to the tree;
  • libarchive and sudo updated both their stable and beta branches;
  • nxhtml got a new release.

the release of Linux-PAM 1.0.0 makes me wonder if I should be trying to complete the move to sys-auth that started in fall 2005. I left PAM out before hoping that epkgmove could improve, but I don’t see that happening anytime soon, so I’m pondering about doing the move manually myself. Linux-PAM should really become sys-auth/linux-pam rather than sys-apps/pam it is now.

The biggest problem I can see are overlays that refer to sys-libs/pam that would be broken just about immediately by the move, and overlays that for some (often stupid) reason have a sys-libs/pam package needing to move that around.

Oh well, just another entry in my TODO list I suppose.

Summer of Code ideas for other projects

I know I already filled the Gentoo SoC project page with ideas, but I still got a few more to propose for organisation which I’m not even sure will be on SoC itself. Think of this post just as a braindump of stuff I’d like from other projects and which I would see well suited for Summer of Code.

  • for lighttpd, a PAM-based authentication module, so that, for instance, I could allow all the xine developers to access the server where xine bugzilla and also access a private HTTP directory on it with a single user and password database (the system);
  • for libarchive (FreeBSD), built-in support for lzma (de-)compression algorithm, so that it could handle GNU’s .tar.lzma files on its own;
  • for glib, a confuse-like configuration file parser, so that I could get rid of that dependency on unieject;

Checking for page changes

Dear lazyweb, I’m asking for help. I maintain the ebuilds for a series of software that seems not to use common release notices like Freshmeat. This is the case, for instance, of libarchive, Nimbus theme and sudo, etc.

I usually check the pages every other day, but it starts to get boring, and something I’d rather automate. Does anybody know of a nice software that checks if pages have changed and send me an email, or a log, or an rss, or anything actually, so that I can just tell that to check some given pages and leave it alone?

Optimally, it should just check the header for the Last-Modified date, or the ETag to be different, without doing a full GET of the page. I could probably write something up in bash to do it, it’s just a matter of netcat and grep, or maybe curl, but I’d actually avoid having to write something up myself if it exists already.

The reason why I would like it to use HEAD rather than GET is because there is no point for a script to request the same page every day to check if it is changed, it’s not like a browser requesting it. This way it would save both my and the site’s bandwidth, which if done properly by all services would be quite a save. Even better, if it could use If-Modified-Since, so that after the first request, every other subsequent request would just get a 304 response and no extra data like content type and length, which requires a stat on the server.

If anybody has a suggestion, it’s very welcome. Even if it’s not in portage, I can create ebuilds and maintain them afterward, I just need something to make my job easier! Of course it’s obvious that it has to run on Linux (and possibly FreeBSD ;) ).

A good kind of cow

After all I wrote before, I’m sure at least some people might think that every COW is a bad COW, and that you should never use COW sections in your programs and libraries.

It’s not exactly this way. There are times when using a copy on write section like .bss is a good choice over the alternatives.

This is true for instance for buffers: there are mainly three ways to handle buffers: malloc() allocated buffers, automatic buffers allocated on the stack, and static buffers that are added to .bss.

A buffer allocated on the stack has the main advantage of not having to be explicitly free()’d, but big buffers on the stack, especially if not well warded, might cause security issues. Plus it might require a big stack.

A buffer allocated in heap through malloc() is more flexible, as you only request memory as needed, and free it as soon as it’s not needed anymore (for stack-based buffers, you need to wait the end of the block, or create a block for the instructions to be executed that use the buffer). This reduces the memory footprint when looked at in time, but it has a little overhead as malloc() and free() are called.

Another option is to use static arrays as buffers. Non-initialised static arrays are put into .bss which is a copy on write section that is usually backed against the zero page (although I’m not yet sure how the changes in Linux 2.6.24 about zero page affect this statement). The good thing about having static buffers is that you don’t need to manage their lifetime, neither explicitly, nor implicitly, as they are already allocated at the start of the program.

This is not good for libraries, as you might have a static buffer in .bss which is never used, but still takes up memory once copy on write of other, used, .bss variables are modified. The thing is better for simple fire-off programs, which starts and terminate quickly (non-interactive programs).

It’s also important to note that libraries should always be designed for work in multi-threaded software as a good design principle, and that static variables and arrays there are not much useful, unless they are all marshalled by the presence of a mutex (which will reduce performance). For this reason, .bss is a bad thing, for libraries, in almost all cases.

For fire-off programs, as I said this is less of a problem, as the buffer might just be used a few couple of times during the life of the program, and if it’s reasonably sized, it might not even impact the whole memory usage of the program (even a single static variable, once changed, will require you to waste a 4KiB memory page, so if you add a 100 bytes variable, that will not change; it will change if you use a 4KiB, or bigger, static buffer).

So sometimes you just have to give up, the static buffer might be increasing the performance of the program, so it just has to stay there. This is why I don’t really fight with .bss too much, the only thing that I don’t think should ever go to .bss are tables: calculating them at runtime is useful only for single task embedded systems, so there has to be a way to opt out from that by using hardcoded tables calculated before or right at build time.

Another good use of .bss is when the memory would just be allocated at the start of the program and then freed at the exit. This is often the case in xine-ui for instance, as there are big structures with all the state information for a certain feature. Those data cannot be shared between instances, and has a life so long that it’s just easier to allocate it all at once, rather than using up heap or stack for them (in xine-ui, especially, some structures were accessed through a .bss pointer, which was set to an allocated memory area once the feature was activated, and freed either when the feature was temporarily not used anymore, or when the program exited; while you don’t always get the stream information, trying to save a few KiBs of memory by using heap memory might not be a good idea if you have to access the data through a pointer, rather than having the structure in .bss and skipping over the pointer).

So for this reason while I’d be happy if we could find ways to avoid using COW sections at all and still be fast, I’m not targetting moving all the numbers to zero, I just want to make sure that there aren’t memory areas where the space is just wasted.

On the other hand, there are cases which show that something in the design of the program might just be way out of the sane league, for instance the 10MB of .bss that is used by the quota support of xfsprogs is tremendously suspicious.

New patches! Just for our Gentoo users – and for developers of other distributions who want to take the patches ;) – I’ve added a few more patches to my overlay: giflib’s patch is now in sync with what upstream applied, moving back a constant to variable as needed, while app-arch/libarchive and sys-apps/file got one patch each to reduce their COW memory usage. Both had good results by applying character arrays in structures.

Package moves

This is just a service notice for all the Gentoo (and in particular Gentoo/FreeBSD) users: app-arch/bsdtar is no more, it has been moved to app-arch/libarchive. This because of the plans found in the FreeBSD Quarterly Status Report for libarchive: Tim is working on a cpio implementation using libarchive; as I suppose he’ll put it directly in libarchive, having it named bsdtar would be a mistake.

As a vulnerability was present in previous bsdtar/libarchive versions, and thus a GLSA is going to be issued soon, DerCorny suggested to make the move now rather than in a short future, so that the GLSA would remain valid, so there you co, the move has been done.

Now, there is but one problem with libarchive 2.2: it fails to extract some tarballs that have strange groups set and setgid settings. This is the case for both slashem and git’s git-manpage tarballs in Portage. I discussed this with Tim some time ago, but he’s quite busy so he didn’t work on it yet, but yesterday by looking around a bit for the vulnerability I found myself thinking of a possible solution. In my overlay you can find a 2.2.4-r1 candidate, although the patch makes the testsuite fail (I’m still working on it).

It shouldn’t be much of a problem to fix it in the next days, and it will certainly be a welcome break from PAM work :)