ModSecurity, changing times

You probably remember my recent rant about Debian’s ModSecurity packaging that started with me trying to get my ruleset to work on VideoLAN to help them fight the spam back. Well, thanks to the guys at the ModSecurity twitter account I was able to get in touch with the Debian maintainer (Alberto) and it now looks like the story will have a happy ending.

Alberto is working on a similar split of the ModSecurity module and the Core Rule Set configuration files, so that they can be managed with the Debian package manager, just like they can be managed with Portage already. And to make it easier to admin both distribution, I’ve decided to make a few changes to the Gentoo ebuilds so that the installed layout of the two varies the least possible.

The first change relates to the internal name of the package; while I haven’t decided to do a package move yet, mod_security is a Gentooish spelling; the package is actually called ModSecurity upstream and the tarball is named modsecurity-apache; you can already see that the CRS is modsecurity-crs. Configuration files and storage directories are now also using modsecurity — I’ll see when I’ll feel to rename the package altogether to www-apache/modsecurity.

The second change relates to the way the rule configuration files are installed; up to now the rules were installed in a subdirectory of the Apache configuration tree; this is not suitable for Debian and even in Gentoo it looked awkward — the new directory for the ModSecurity CRS rules is /etc/modsecurity. Furthermore, what once was modsecurity_crs_10_config.conf is now /etc/apache2/modules.d/80_modsecurity-crs.conf and includes the inclusion masks for the rest of the rules to include. This will allow the ebuild to enable/disable rules depending on USE flags in the future.

And to make it as easy to deal with as possible, I’ve now added a geoip USE flag to mod_security — which does nothing more than adding dev-libs/geoip to its runtime dependencies and set the configuration file to use the database installed by that ebuild. The reason to having this dependency is two-fold: from one side, declaring the dependency helps making sure that the database is installed and kept updated by Portage; from the other side, if you already have a license to use MaxMind’s GeoIP databases, the package provides you with all the updater scripts you need to get the updated data from MaxMInd.

A little digression about GeoIP: I think that it might be a good idea to consider changing the GeoIP ebuild and have instead a virtual that provides the database, either in form of the updater scripts to get the paid versions, or GeoLite packages that can be updated regularly. Unfortunately I don’t have the time to follow something like this for now.

Going back to my personal favourite subject on the ModSecurity topic, my ruleset has gained a number of fake-browser pattern matching ­with a fairly low risk of false positive – thanks to the testing that you helped me with – and should now filter almost any possible spam you’re going to receive. I’m now updating the documentation to provide examples on how to debug the rules themselves; in the next days I might try to get some extra time to tag all the rules so that they can be disabled in block when the new ModSecurity 2.6 is released.

Don’t forget to recommend my ruleset, report problems and … flattr it!

ModSecurity and Debian, let the challenge begin

Some of you might have already read about my personal ruleset that I developed to protect my blog from the tons of spam comments that it receives daily. It is a set of configuration files for ModSecurity for Apache, that denies access to my websites to crawlers, spammers and other malicious clients.

I was talking with Jean-Baptiste of VLC fame the past two days about using the same ruleset to protect their Wiki, which has even worse spam problems than my blog. Judging from the logs j-b has shown me, my rules already cover most of the requests he’s seeing (which is a very positive note for my ruleset); on the other hand, configuring their web host to properly make use of them is proving quite tricky.

In Gentoo, when you install ModSecurity you get both the Apache module, with its basic configuration, and a separate package with the Core Rule Set (CRS). This division is an idea of mine to solve the problem of updating the rules, which are sometimes updated even when the code itself is unchanged — that’s the whole point of making the rules independent of the engine. By using the split package layout, the updater script that is designed to be used together with ModSecurity is not useful on Gentoo so it’s not even installed — even though it is also supposedly flexible enough that I could make it usable with my ruleset as well.

In Debian, though, the situation is quite more complex. First of all there is no configuration installed with the libapache-mod-security package, which only installs the file to load the module, and the module itself. At a minimum, for ModSecurity to work you have to configure the SecData directive, and then give it the set of rules to use. The CRS files, including the basic configuration files, are installed by the Debian packages as part of the documentation, in /usr/share/doc/mod-security-common/examples/rules/.

I’ve now improved the code to provide an init configuration file that can be used without CRS.. but it seriously makes me wonder how can Debian admin deal with ModSecurity at all.

Finally, a consideration: the next version of ModSecurity will have support for looking posted URLs up in the Google Safebrowsing database, which is very good as an antispam measure.. I have hopes that either the next release or the one after will also bring Project Honey Pot http:BL support, given the Apache module was totally messed up and was unusable. That would make it a sweet tool to block crawlers and spammers!

Why Ruby ebuilds are not autogenerated

From time to time somebody drops in on #gentoo-ruby, or mails the Gentoo Ruby team alias, either asking about, proposing to help with or outright attacking us for not having a tool that automatically generates ebuild files for gems. Call it or design it like you prefer, but in general, the whole question boils down to “why do we think ebuilds should not be autogenerated?“.

At a first glance, the Gem specification (gemspec) provides information the same way an ebuild does, for the most part, so why shouldn’t it be feasible to produce an ebuild out of that? Well there are many reasons, but the main one is that gemspecs don’t provide all the information we need, followed straight by gems don’t mandate enough of a standardised interface.

These two problems find their roots in the RubyGems approach of creating a format and managing it as a single piece of code. As I have written many times, the RubyGems project should have split their focus and handled them independently rather than as a single, monolithic black box:

  • discovery of the installed extensions is something RubyGems handle quite well, most of the time at least; as a standard for install location, and requests of particular libraries, it is a method good enough that in Gentoo we have no intention whatsoever to replace it; indeed the RubyNG eclasses install in the same location, and provide a compatible registration of the package; the problem here is the fact that the registration is not independent from the way the code is installed, but is a gemspec by itself;
  • the package distribution format used by gems is problematic as we said before; some older gems (luckily I haven’t seen one in a long time) used to provide base64-encoded archives rather than the double-tarball they come in right now, and even the double-tarball is a bit nasty to deal with; on the other hand I can see why it was designed that way and it’s not all bad;

  • what I think really suck is the gem package manager which is the issue that both Debian and Gentoo have with the whole system: it does not comply with the usual standard for distributions, and it shows that it was designed for platforms where no package manager is available (namely Windows and OS X).

Because these three objectives haven’t been separated, only the strict internal requirements between the three of them have been taken into account. If the package distribution format was instead designed by consulting with the developers of distributions’ package managers, it would most likely have taken a quite different direction, by providing the metadata we currently lack, to begin with.

For instance, the specifications don’t tell us which license the gem is released under, they don’t tell us which dependencies are mandatory and which are optional, they often don’t tell us the correct test dependencies either.

But most importantly, the whole RubyGems infrastructure mandates no interface. How do you run tests? Most of the times, you can just run “rake test” and it works, good, but is it always the case? Of course not. Even if a small set of developers do create the aliases when they use rspec or other software for testing, there is no mandatory interface for testing, and that is quite a problem.

Again, as I said above, if the three goals were to be split properly, then it would have been quite easy to leverage the abilities of the various package managers into the gemspec format: all distributions want to know the license of a package, and most will want to have a mandatory interface for build, test and install. If you had the gems as package-manager-agnostic packaging for Ruby extension, you’d have win: distributions would use autogenerated packages for their own repositories, and the RubyGems manager would add integrated external packages for the stuff that was missing.

Unfortunately, life is nowhere near as easy. But since I’m gathering more knowledge about the common tasks that Ruby extensions have right now, I’ll probably resume my “Ruby Packaging Specification” project and try to give pointers on how to achieve a more standardised interface for gems so that, slowly maybe, we can reach the goal of having gems play nice with our package managers.

RTSP clients’ special hell

This week, in Orvieto, Italy, there was OOoCon 2009 and the lscube team (also known as “the rest of the feng developer beside me”) was there to handle the live audio/video steaming.

During the preparations, Luca called me one morning, complaining that the new RTSP parser in feng (which I wrote almost single handedly) refused to play nice with the VLC version shipped with Ubuntu 9.04: the problem was tracked down to be in the parser for the Range header, in particular in the normal play time value parsing: the RFC states that I’m expecting a decimal value with a dot (.) as the separator, but VLC is sending a comma (,) which my parser is refusing.

Given Luca actually woke me up while I was in bed, it was a strange presence of mind that let me ask him which language (locale) was the system set in: Italian. Telling him to try using the C locale was enough to get VLC to comply with the protocol. The problem here is that the separators for decimal places and thousands are locale-dependent characters; while most programming languages obviously limit themselves at supporting the dot, and a lot of software likewise use that no matter what the locale is (for instance right now I have Transmission open and the download/upload stats use the dot, even though my system is configured in Italian). Funny that this problem came up during an OpenOffice event, given that’s definitely one of the most known software that actually rely (and sometimes messes up) with that difference.

To be precise, though, the problem here is not with VLC by itself: the problem is with the live555 (badly named media-plugins/live in Gentoo) library, which provides the generic RTSP code for VLC (and MPlayer). If you ever wrote software that dealt with float to string conversion you probably know that the standard printf()-like interface does not respect locale settings; but live555 is a C++ library and it probably uses string streams.

At any rate, the bug was known and fixed already in live555, which is what Gentoo already have, and the contributed bundled libraries of VLC have (for the Windows and OS X builds), so those three VLC instances are just fine, but the problem is still present in both the Debian and Ubuntu versions of the package which are quite outdated (as xtophe confirmed). Since the RFC does not have any conflicting use of the comma in that particular place, given the extension of the broken package (Ubuntu 9.10 also have the same problem), we decided for working it around inside the feng parser, and accepting the comma-separated decimal value instead.

From this situation, I also ended up comparing the various RTSP clients that we are trying to work with, and the results are quite mixed, which is somewhat worrisome to me:

  • latest VLC builds for proprietary operating systems work fine (Windows and OS X);
  • VLC as compiled in Gentoo also work fine, thanks Alexis!
  • VLC as packaged for Debian (and Ubuntu) uses a very old live555 library; the problem described here is now worked around, but I’m pretty sure it’s not the only one that we’re going to hit in the future, so it’s not a good thing that the Debian live555 packaging is so old;
  • VLC as packaged in Fedora fails in many different ways: it goes in a loop for about 15 minutes saying that it cannot identify the host’s IP address, then it finally seem to be able to get a clue, so it’s able to request the connection but… it starts dropping frames, saying that it cannot decode and stuff like that (I’m connected over gigabit lan);
  • Apple’s QuickTime X is somewhat strange; on Merrimac, since I used it to test the HTTP tunnel implementation it now only tries connecting to feng via HTTP rather than using RTSP; this works fine with the branch that implements it but fails badly in master obviously (and it doesn’t look like QuickTime gets the hint of changing to RTSP protocol); on the other hand it works fine on the laptop (that has never used the tunnel in the first place), where it uses RTSP properly;
  • again Apple’s QuickTime, this time on Windows, seems to be working fine.

I’m probably going to have to check the VLC/live packaging of other distributions to see how many workaround for broken stuff we might have to look out for. Which means more and more virtual machines, I’ll probably have to get one more hard drive by this pace (or I could probably replace one 320G drive with a 500G drive that I still have at home…). And I should try totem as well.

Definitely, RTSP clients are a hell of a thing to test.

Signatures, security, attack vectors

In the past weeks we have assisted to lots of concern in the free software world about the problems tied with the simplification of attacks on the SHA-1 algorithm. Unfortunately this type of news, while pretty important, is only meaningful for the people who actually are expert in cryptography, and can confuse the hell of people, such as me, who don’t know that much about the topic.

So while there are posts about the practical attacks to git with SHA-1 weakness which may seem far fetched for some users, I’d like to try understanding, and making understood what the real world implications are of using weak hash algorithms in many situation.

The first problem that comes to my mind is very likely social: we call them “unique identifiers” but there is nothing in the math, as far as I can see, that do make them unique, and “one-way hash”, while obviously you can revert that with a properly sized table. What good hashes are designed for is making sure that the chances of a collision are low enough that it’s infeasible to hit them, and for the tables needed for reversing the hash to be huge enough that normal computers can’t handle it.

Of course, the concepts of “infeasible” and “huge” in computers are quite vague: while something may very well be infeasible for my cellphone, it might not be for the iMac, or it might be huge for Merrimac but not for Yamato. What was absolutely impossible for personal computers five years ago might very well be a cakewalk for a PlayStation 3 (think about the December 2008 SSL Certificate attack ). And this is without considering clusters and supercomputers; which means that we have to take a lateral approach to all this rather than just following the God Hash Function.

Hash functions are not just used for cryptographic work of course; sometimes it’s just a redundancy check to ensure that the data arrived properly, for instance the TCP protocol still supports checksumming the packets with CRC, even though the collision-free space for CRCs is quite smaller than MD5, which itself we know is no longer a valid security solution. There are some cases where just being able to tell if some data arrived or was decoded properly, that using cryptographic hashes is not very important at all, where speed is more of a concern. In those cases, CRC32 still performs pretty neatly.

On a similar note I still can’t understand why FLAC need to store the MD5 hash of the uncompressed wave data; sure it’s very desirable to have a checksum to tell if the decode was done properly, but I’d expect a CRC32 checksum to be quite enough for that, I don’t see why going with the big guns…

Anyway, this moves on to the next point; having a checksum, a hash, a digest for a file that has to be downloaded is useful to know whether the download completed successfully or if there was problem during the transmission; is MD5 enough there? Well it really depends, if it’s just to make sure that some data that is not tremendously important, because it’s not going to execute code on the machine, like a photo, or a video, then it might well be enough; sometimes CRC32 is also quite enough (for instance if you’re an anime fan you probably have noticed quite a few times the episodes having a CRC32 checksum in the downloaded file name – of course downloading pirated anime is illegal, just remember that next time you do it…).

But is it the same thing for source code? Why doesn’t Gentoo use MD5 any longer? Why are we using both SHA-256 and RMD-160? Obviously it’s not the same for source code, and while using more resilient hash algorithms (I was going to say “stronger” but that’s not the point of course) is necessary, is by far not sufficient. With source and executable code, we don’t only want to ensure that the data was received correctly, but also that the data is what we want it to be. This means that we need to certify that the downloaded data correspond to what was tested and found safe.

For this step we have to introduce a different concept from the mere hash: the signature; we need to sign the data to make sure that it’s not changed, and that if it’s tampered with, we want to make sure that the signature doesn’t match. GnuPG signatures are meant just to do that, but they also rely on a hash algorithm, that nowadays tend to be SHA-1, unless, like the Debian developers, you start to change it to SHA-256 or whatever else. Does it make such a difference? It depends on what you use the key for, one should say.

There are two main critiques against the use of different hashing algorithms for GnuPG key generation by default: the GnuPG maintainer said that the (economical) resources needed to counterfeit a signature nowadays are high enough that would still allow somebody to just pay a very bad guy to arrive at you. With a gun. Or worse. The second is that to perform a fake signature on an email message, you’re going to need to add lots and lots of spurious data, which will be quite a sell off of the way the message was produced.

Of course these points are both true; but there are one catch for each: the former is true for now but is not going to remain true forever, not only there can be more weaknesses on the algorithm to be found, but the average computing power of a single individual is still increasing year after year; while 200 PS3 systems don’t come cheap nowadays they certainly are more feasible, and less risky, to procure than a serial killer. And they are much lower profile.

The latter point is more interesting, because it shows some limits to the ability of forging a duplicate key or counterfeiting a signed message. Indeed, whatever the algorithm used, a simple signed text message, once counterfeited, is going to be easily spoofed by the presence of data that is bogus or not relevant to the message. While the technical chance that a way is found to make a counterfeited message that only contains words in the correct language, and that is thus easy to blend with the rest of the message, is not null, it’s also quite far fetched nowadays even for CRC I’d say. That should be enough for email messages.

But is it for every application of GnuPG keys? I don’t think so; as you might have read in the post I linked early in this entry about the chances of using the SHA-1 attacks to fool the GIT content tracker, it is possible to replace source code even when entering bogus data, because almost nobody will be going through all the source files to see if there is something strange in them. Similarly, spoofing signatures for binary files is not as hard to achieve as spoofing signatures for email messages. Even more so when you count that bzip2, gzip, and lzma all ignore trailing unknown data in their archives (which is a feature used even by Gentoo for the binary packages Portage builds). Which means that keys used for signing source and binary packages, like in the cases of Debian, and Gentoo, are more at risk for the SHA-1 attack than keys used just to sign email messages.

There are more things about this, but since I’m no expert I don’t want to go longer ways on this. There is much more to be said about the panacea of signatures, because as it appeared in my previous post about github there are quite a few users that are confused by what git tag signatures should mean to Gentoo developers and users. But this is the kind of stuff I always wanted to write about and almost never had time, I guess I’ll try my best to find time for it.

I still dislike github

When github was uncovered, I was a bit concerned with the idea that it seemed to foment the idea of forking software all over the place, instead of branching, with results that are, in my opinion, quite upsetting in the way some software is handled (see also this David Welton post which is actually quite to the point – I don’t always bash Debian, you know, and at least with the Debian Ruby team I seem to be often on the same page ). I was so concerned that I even wrote an article for LWN about forking and the problems that it comes with.

Thinking about this, I should tell people to read that when they talk about the eglibc mess. And when I can find time I should see to translate my old article about MySQL from Italian to English – and maybe I should see to change the articles page to link the articles directly in HTML form rather than just PDF and DVI.

At any rate, the “fork it” button is not what I’m going to blog about today, but rather what happened yesterday when I decided to update hpricot which is now hosted strictly on github. Indeed there is no download page if not the one in github which points to the tags of the git repository to download.

It starts to get increasingly used the idea that just tagging a release is enough to get it downloaded, no testing, no packaging, nothing else. For Ruby stuff gems are prepared, but that’s it (and I think that github integrates enough logic for not even doing that). It’s cool, isn’t it? No it’s not, not for distributions and not for security.

There is one very important feature for distributions on released code and is the verifiability of the release archives, while it might be a bit too much to ask for all the upstream projects to have a verifiable GnuPG signature and sign all their release, but at least making sure that a release tarball will always be available identical to everybody who download it would be usable. I let you guess that github does not do that which is giving me headaches since it means I have to create the tarballs manually and push them to the Gentoo mirrors for them to be available (git archive makes it not too difficult, but it’s still more difficult that just fetching the release upstream).

I wonder how it might be possible to explain to the Ruby community (because here it’s not just the Rails community I’d say) that distributions are a key to proper management and not something to hinder at every turn.

A “new” C library

Debian announced they are going to move away from the GNU C library toward eglibc, a derivative designed to work on embedded systems; not even a few hours after I shared it on Google Reader myself, I was contacted regarding a Gentoo bug for it . Since I don’t like repeating myself too much, I’m just going to write here what I think.

First of all, the idea is interesting, especially for the embedded developer in me (which is still waiting for the time to go buy the pins to solder on a serial port on my WRT54GL ), but also for the “alternative” developer in me. I have worked on Gentoo/FreeBSD and I always hoped to find a way to handle an uclibc chroot to test my own stuff. Testing eglibc is going to be interesting too (if only I had time to finish analysing the tinderbox logs, that is).

What I do find quite unfunny, and a bit discomforting, is that half the “features” that Aurélien Jarno lists are “we don’t need to deal with Drepper”. Now, I agree that Ulrich is not the best person to deal with (although I’d sooner deal with him than with Ciaran, but that’s another problem *edit: since it wasn’t really a good example here, I wish to explain it; it is public knowledge that me and Ciaran don’t get along too well, I don’t like his solutions and his methods; on the other hand, while I dislike Ulrich’s methods too, I have less problems with his solution, and I never had a personal quarrel with him, there goes my “I’d sooner deal with him” comment* ), and I also agree that his ideas of “good” sometimes are difficult to share (especially when it comes to implementation of versioning for ELF symbols which gave me such an headache to replicate in Ruby-Elf). On the other hand, I wonder how much of that choice is warranted.

What most people seem to compare this to is the move from XFree to Xorg, or to the fork of cdrecord into cdrkit. I disagree with comparing this cases with those two. Both those were due to license issues, which, for people caring about the freedom of their software, are one of the most important issues (unless, of course, you just don’t care and go on forth with piracy — which is what actually brought me to a bit of a nervous point with ALTlinux in the past). While not having assholes around is probably as important (and I’d point to this book which I remember Donnie describing before; unfortunately I haven’t had the pleasure to read it yet), I still don’t see this like the brightest move Debian could have done.

More to the point, the cdrkit fork doesn’t look like one of the shiniest things in Free Software; while cdrecord is no longer the massively single point of failure for CD/DVD burning in Linux, one has to note that this is also due to other projects, like libburn, having had an injection of development and race toward support, once the idea that cdrecord wasn’t good to keep started flowing around. And the XFree to Xorg move was extremely helped (and made successful) by the fact that the developers for XFree itself moved out of the project toward Xorg.

I’m not criticising Debian’s move, I’m actually thrilled to see the results; I’m not criticising eglibc, I’m very interested in the project. I’m just trying to throw a water bucket over the people who seem to be on fire about eglibc now. I don’t see this like a huge paradigm change for now. Once eglibc has huge advantages (which for now I don’t see), we can probably get passionate about moving. Right now I don’t see this huge change, there are neat ideas and certainly it’s a good thing not to have assholes blocking the project, but is that enough?

Now, more to the Gentoo side of the thing; I’m not part of the toolchain team, so I don’t know if they have any particular, special plan about this. I would expect them not to for now at least; adding support for a new C library is not impossible, but it’s not easy either. I might be worrying for nothing, but I don’t trust the “100% compatibility” that Debian seems to ensure about EGLIBC versus GLIBC, even if it’s just bugfixes over a given GLIBC version (which would bring me to my hate of forking projects); I wrote some pieces about the difficulty of ABI compatibility and while I did also show how to keep backward compatibility, I said nothing about forward compatibility.

Also, I don’t trust Debian’s assurances that it works with all the software; open-source or not. The reason is not only to be found in Debian not being that reliable but also in the fact that we’re not Debian: we have differnet patchsets, Debian tends not to send stuff upstream, and so on. We also leave the user with full access to tinker with features, which would mean being able to disable certain stuff from eglibc (I’d welcome that, there are a few things that I’d like to disable on my server for instance); in turn this means that either some USE flag configurations will get unsupported (or unavailable), or we’re going to need special dependencies to ensure that certain pieces of the C library are enabled for eglibc. Bottom line: we’re going to need a long test period either way (to be noted that we have big problems even between minor bumps of the same C library – think glibc 2.8 or 2.9, or FreeBSD 6.3 – which means that we’re going to need lots of testing to move to a new C library altogether).

Adding to this, is the fact that adding a new C library will mean new profiles (new profiles to test too), and new stages. Which means more work for everybody. Doesn’t mean that it’s not going to happen, just that you can’t expect that to happen tomorrow.

Debian, Gentoo, FreeBSD, GNU/kFreeBSD

To shed some light and get around the confusion that seems to have taken quite a bit of people who came to ask me what I think about Debian adding GNU/kFreeBSD to the main archive, I’d like to point out, once again, that Gentoo/FreeBSD has never been the same class of project as Debian’s GNU/kFreeBSD port. Interestingly enough, I already said this before more than three years ago.

Debian’s GNU/kFreeBSD uses the FreeBSD kernel but keeps the GNU userland, which means the GNU C Library (glibc), the GNU utilities (coreutils) and so on so forth; on the other hand, Gentoo/FreeBSD uses both the vanilla FreeBSD kernel, and mostly vanilla userland. With mostly I mean that some parts of the standard FreeBSD userland are replaced, with either compatible, selectable or updated packages. For instance instead of shipping sendmail or the ISC dhcp packages as part of the base system, Gentoo/FreeBSD leaves them to be installed as extra packages, just like you’d do with Gentoo. And you can choose whichever cron software you’d like instead of using the single default provided by the system.

But, if a software is designed to build on FreeBSD, it usually builds just as fine on Gentoo/FreeBSD; rarely there are troubles, and most of the time the trouble are with different GCC versions. On the other hand, GNU/kFreeBSD require most of the system-dependant code to be ported, xine already has undergone this at least a couple of time for instance.

I sincerely am glad to see that Debian finally came to the point of accepting GNU/kFreeBSD into main; on the other hand, I have no big interest on it beside as a proof of concept; there are things that are not currently supported by glibc even on Linux, like SCTP, which on FreeBSD are provided by the standard C library; I’m not sure if they are going to port the Linux SCTP library to kFreeBSD or if they decided to implement the interface inside glibc. If that last one is the case, though, I’d be glad because it would finally mean that the code wouldn’t be left as stale.

So please, don’t mix in Gentoo/FreeBSD with Debian’s GNU/kFreeBSD. And don’t even try to call it Gentoo GNU/FreeBSD like the Wikipedia people tried to do.

The UTF-8 security challenge

I make no mystery of the fact I like my surname to be spelt correctly, even if that’s internationally difficult. I don’t think that’s too much to ask sincerely; if you want to refer to me and you don’t know how to spell my surname, you have a few other options, starting from my nickname (“Flameeyes”), which I keep on using everywhere, included the domain of this blog because, well, it’s a name as good as my “real” name. While I know other developers, starting from Donnie, prefer to be recognized mainly by their real name; since I know my name is difficult to type for most English speakers, I don’t usually ask that much; Flameeyes was, after all, more unique to me than “Diego Pettenò”, since of the latter there are other three just in my city.

But even without going with nicknames, that might not sound “professional”, I’m fine with being called Diego (in Gentoo I’m the only one; for what concern multimedia areas, I’m Diego #2 since “the other Diego” Biurrun takes due priority), or since a few months ago Diego Elio (I don’t pretend to be unique in the world but when I chose my new name, beside choosing my grandfather’s name, I also checked I wouldn’t step in the shoes of another developer), or, if you really really need to type my name in full, “Diego Petteno`” (yes there is an ascii character to represent my accent and it’s not the usual single quotation mark; even the quotation mark, though, works as a tentative, like for banks and credit cards . If you’re in a particular good mood and want to tease me around you could also use 炎目 (which is probably a too literal translation of “Flameeyes” in kanji); I think the only person ever using that to call me has been Chris (White), and it also does not solve the issue of UTF-8.

Turns out it’s not that easy at all. I probably have gone a little overboard the other day about one GLSA mistyping my name (it still does), because our security guys are innocent on the matter: glsa-check breaks with UTF-8 in the GLSA XML files (which is broken of glsa-check, since you should not assume anything about the encoding of XML files, each file declares its own encoding!), which makes it hard to type my name; tthe reason why I was surprised (and somewhat annoyed) is that I was expecting it to be typed right for once, py handled it and I’m sure he has the ò character on his keyboard.

Curious about this, I also wanted to confirm how the other distributions would handle my name. A good chance to do that was provided by CVE-2008-4316 (which I discussed briefly already ). The results are funny, disappointing and interesting at the same time.

The oCERT advisory has a broken encoding and shows the “unknown character” symbol (�); on the other hand, Will’s mail at SecurityFocus shows my name properly. Debian cuts my surname, while Ubuntu simply mistype it; on the other hand, Red Hat is showing it properly; score one for Red Hat.

One out of four distributions (Gentoo has no GLSA on the matter, but I know what would have happened, nor the CVE links to other distributions, just a few more security-focused sites I’m not interested about in this momet) handle my name correctly, that’s not really good. Especially, I’m surprised that the one distribution getting it right is Red Hat, since the other two are the ones I usually see called in the mix when people talk about localising Free Software packages. Gentoo at least does not pretend to be ready for internationalisation in the first place (although we have a GLEP that does ).

Okay I certainly am a nit-picker, but we’re in 2009, there are good ways to handle UTF-8, and the only obstacles I see nowadays are very old legacy software and English speakers who maintain that seven bits are enough to encode the world, which is not true by a long shot.

A couple of thoughts about package splitting

In my post regarding remote debugging (which I promised to finish with a second part, I just didn’t have time to test a couple of things), I’ve suggested the idea I’d like to have some kind of package splitting in Portage, to create multiple binary packages out of a single source package and ebuild, similarly to what distributions based on RPM or deb do (let’s call them RedHat and Debian, for historical reasons).

Now, I want to make sure nobody misunderstand me: I don’t intend to propose this as a way of removing the fine-grained control USE flags give us; I sincerely love that; and I also love not having to worry about installing -dev and -devel packages on my machines to be able to build software, even outside of the package manager’s control. I really find these two are strengths of Gentoo, rather than weakness, so I have no intention to fiddle with them. On the other hand, I think there are enough uses that would allow for an even finer control on binpkg level.

I’ve already given a scenario in my post about remote server debugging, but let’s try to show something different, something I’ve actually been thinking about myself. Yes I know this is a very vested interest for me, but I also think this is what makes Free Software great most of the time; we’re not solutions looking for problem, but usually solutions to problem one had at least at one point in time. Just like my writing support for NFS export on the HFS+ filesystem in Linux.

So let me try to introduce the scenario I’ve been thinking about. As it happens, I tend to a series of boxes in many offices for friends and friends of friends in my spare time, on the side. It’s not too bad, it does not pay my bills, but it does pay for some side things, which is good. Now since these offices usually use Windows, even though I obviously install Firefox as the second step after doing the system updates, it’s not unlikely that every other time I go there I have to clean up the systems. I think there are computers I’ve wiped up and reinstalled a few times already. I’ve now been thinking about setting up some firewalls based on Snort or similar. Since I am who I am, these would end up being Gentoo-based (as a side note, I’m tempted to set it up here so I can finally stop having trouble with Vista-based laptops that mess up my network). Oh and please, I know it might sound very stupid considering there are solutions good for this already, but considering how much I’m paid and the amount of money they are ready to spend (read: near to none), I would find it nicer to be paid to work on some Gentoo-related stuff than be paid to just look up and learn how to use already made equipment. Of course if you have suggestion, they are welcome anyway.

So anyway, in this situation I’d have to set up boxes that would usually feel very embedded-like: a common basis, the minimum maintenance possible, upgrades when needed. Donnie’s idea of using remote package fetching and instant deletion is not that good for this because it still requires a huge pipe to shove the data around; not only I don’t have so much upload bandwidth to employ for binpkging a whole system with debug information, it would also be a hit that most of my users wouldn’t like to have, on their bandwidth (if they want to use BitTorrent or look up p0rn from the office is not my problem).

With this in mind, I’d sincerely find it much nicer to be able to split packages, Portage-side, into multiple binary packages that can be fetched, synced, or whatever else, independently, as needed. As I proposed, a binpkg for the debug information files, but also a binpkg for documentation (including man and info pages), one for development data (headers, pkg-config), and maybe one for the prepared sources, that I want to talk about in a moment. With an environment variable it shouldn’t be much of a problem to choose which ones of these split binary packages to install in the system without explicit request; with a default including all of them but the debug informations and the sources. This would also replace the INSTALL_MASK approach as well as noinfo, noman, nodoc FEATURES. It wouldn’t be like a logical split of a package in multiple entries in the system, but rather a way to choose which parts to install, complementary to USE flags.

As for packaging the sources as I said above, there are two interesting points to be made for that, or maybe three. The first problem is that when you have to distribute a system based on Gentoo, you cannot just provide the binaries; since many packages are released under the GNU GPL version 2, even if you didn’t change the sources at all you should be distributing them alongside the binaries; and we modify a lot of sources. For license compliance we should also provide the full set of sources from which the code is derived. This is especially tricky for embedded systems. By packaging up the sources used for the builds, embedded distributors would be able to just provide all the -src subpackages as the full sources for the system.

The second point is that you can use the source packages for debugging too. Since there is, as far as I know,no way to fully embed the source code of software in the debug section of the files generated from that, the only way for GDB to display the source code lines during debugging is having the source files used for build available during the debugging session. This can easily be done by packaging up the sources and installing them in, say, /usr/src/portage/ when they are needed, from a subpackage.

A final point would be that by packaging sources in sub-packages, and distributing them, we could be reducing the overhead for users to unpack (maybe with uncommon package formats) and prepare sources (maybe with lots of patches and autotools rebuilding). Let’s say that every 6 hours a server produces md5-based source subpackages for all the ebuilds of the tree, or a subset of them. Users would then use those sources primarily, but still having the ebuilds to provide all the data and workflow so that the original untouched source would be enough to compile the package. Of course this would then require us to express dependencies on a per-phase basis, since then autotools wouldn’t be required at buildtime at all.

Okay I guess I’m really dreaming lately, but I think that throwing around some ideas is still better than not doing so, they can always be picked up and worked on; sometimes it worked.