User-Agent strings and entropy

It was 2008 when I first got the idea to filter User-Agents as an antispam measure. It worked for quite a while on its own, but recently my ruleset had to come up with more sophisticated fingerprinting to discover spammers. It still works better than a captcha, but it did worsen a bit.

One of the reasons why the User-Agent itself is not enough anymore is that my filtering has been hindered by a more important project. EFF’s Panopticlick has shown that the uniqueness of the strings used in User-Agent is actually an easy way to track a specific user across requests. This got so important, that Mozilla standardized their User-Agents starting with Firefox 4, to reduce their size and thus their entropy. Among other things, the “trail” component has been fixed on the desktop to 20100101 and to the same version as Firefox for the mobile version.

_Unfortunately, Mozilla lies on that page. Not only the trail is not fixed for Firefox Aurora (i.e. the alpha version), which means that my first set of rules was refusing access to all the users of that version, but also their own Lightning extension for SeaMonkey appends to the User-Agent, when they said that it wasn’t supported anymore._

A number of spambots seem to get this wrong, by the way. My guess is that they have some code that generates the User-Agent by adding a bunch of fragments, and make it randomize it, so you can’t just kick a particular agent. Damn smart if you ask me, unfortunately, as ModSecurity hashes the IP collection by remote address and user-agent, so if they cycle different user agents, it’s harder for ModSecurity to understand that it’s actually the same IP address.

I do have some reserves on Mozilla’s handling of identification of extensions. First they say that extensions and plugins should not edit the agent string anymore – but Lightning does! – then they suggest that instead they can send an extra header to identify themselves. But that just means that fingerprinting systems only need to start counting those headers as well as the generic ones that Panopticlick already considers.

On the other hand, other browsers don’t seem to have gotten the memo yet — indeed, both Safari’s and Chrome’s strings are long and include a bunch of almost-independent version numbers (AppleWebKit, Chrome, Safari — and Mobile on the iOS versions). It gets worse on Android, as both the standard browser and Chrome provide a full build identifier, which is not only different from one device to the next, but also from one firmware to the next. Given that each mobile provider has its own builds, I would be very surprised if among my friends I was able to find two with the same identifier in their browsers. Firefox is a bit better on that but it sucks in other ways so I’m not using it as my main browser anymore there.

HTML5: compliance shouldn’t require support

Seems like the whole thing about HTML5 and video/audio formats is not done yet, three years after my cursing at Quassel due to qt-webkit because Qt-Webkit decided to bring in GStreamer to support HTML5 video.

This time, the issue is with both Firefox and Thunderbird, both of which come with a webm USE flag that, if disabled, make them fail to build.

I start to wonder, why people insist that for HTML5 compliance you have to support viewing the video? All that you got to do is be able to parse the element and act on it; showing a “This content is not available with your current browser” is quite fine, if I don’t want WebM support!

No technical content for today, it’s Sunday and I’m fighting with getting Thunderbird to work.

Don’t try being smarter than me — you aren’t, you’re just a website!

So today the new major version of Firefox was released. I care relatively since I’m actually a Chrome user — and the one killer feature Firefox had over Chrome is now gone (Delicious extension, doesn’t work with Firefox 4 yet). But I have to test Firefox, and today I was explicitly working on my rule set so I was interested to get the new version of Firefox on all of my systems. Also, my mother uses both Chrome and Firefox (don’t ask) and thus I had to download a copy for her as well.

This would usually not be a problem: you go to the website and download the copy you need to run on your system, even better it provides you a direct link for the package you need to use on the system you’re visiting from. But I’m not your average user, I wanted to download the Windows and OS X versions of Firefox from my main workstation, running Linux. I don’t need the Linux version here for two reasons: I don’t use Firefox here at all; if I did, I would let Portage manage that.

For most download pages, there is a simple way to deal with that as well: you have a link “Other operating systems and languages”, that gives you access to the other binary packages — I actually like LibreOffice choice: it loads all the platforms and languages in selection boxes, select by default the one you’re visiting from, but easily let you choose the others for the same language. If you visit the English/American Firefox homepage there is such a link indeed.

But since my locale is set to Italian on this computer, whenever I visit either getfirefox.com or mozilla.com I hit a captive redirect that brings me to the Italian version of the website. This is a way to be friendly to the users I’m sure, but there are two issues:

  • the link to other operating systems and languages is gone, replaced instead by the release notes; Jo (directhex) noticed that the same is true for French and German sites as well;
  • there is no way to navigate from the Italian website to the English one or the other way around: this makes it impossible to reach the page above.

And this is not the first time I find something like this, but finding it on a package such as Firefox, really stressed me up.

In all fairness

I know that Apple got a lot of hate from Free Software developers (and not) for the way they handle their App Store, mostly regarding the difficulty to actually get application approved. I sincerely have no direct experience on the thing, but if I apply what I learnt from Gentoo, the time they might take to get the applications approved sounds quite about right for a thorough verification.

Google on the other hand, was said to take much less time, but by personal experience to search for content on the Android Market, I can only find DVD Jon’s post quite on the line. There are a number of applications that are, if not entirely, on the verge of frauds, that got easily approved.

On the other hand, as soon as Google was found to add to the Froyo terms of services the fact that they reserve the option of remotely killing an application, tons of users cried foul. Just like they did for Apple, that also has the same capability and has been exercising it for applications that were later found not to agree with their terms of services.

*A note here: you might not like the way Apple insists on telling you what you should or should not use. I understand it pretty well, and that’s one of the reasons why I don’t use an iPhone. On the other hand, I don’t think you can say that Apple is doing something evil by doing so. Their platform, their choice; get a different platform for a different choice.*

So there are a number of people who think that Apple’s policy in reviewing application is evil (and Google’s allowing possible frauds is a-ok), and in both cases, the remote killswitch is something nasty and a way for them to censor the content for whatever evil plan they have. That points a black light on both of them, doesn’t it? But Mozilla should be fine, shouldn’t it?

I was sincerely wondering what those people who always find a way to despise “big companies” like Apple and Google at the same time, asking their users to choose “freer” alternatives (often times with worse problems) would think while I was reading Netcraft’s report of the malware addon found on the Mozilla index.

I quote: “Mozilla will be automatically disabling the add-on for anyone who has downloaded and installed it.” So Mozilla has a remote killswitch for extensions? Or how are they achieving this?

And again: “[Mozilla] are currently working on a new security model that will require all add-ons to be code-reviewed before becoming discoverable on addons.mozilla.org.” Which means they are going to do the same thing that Apple and Google already do (we’ll have to wait and see to find out to which degree).

Before people misunderstand me: I have nothing against Mozilla and I think they are on the right track here. I would actually hope for Google to tighten their approval process, even if that means much longer turnaround for new applications to be available. As an user, I’d find it much more reassuring than what we have right now (why half the demo/free versions of various apps want to access my personal data, hmm?).

What I’m trying to say here, is that we should really stop crying foul for any choice that Apple (or Microsoft, or Sony, or whoever) makes, they might have quite good reasons to do so, and we might actually follow their steps (like Mozilla appears to be going to do).

Bundling libraries: the curse of the ancients

I was very upset by one comment from Ardour’s lead developer Paul Davis in a recently reported “bug” about the un-bundling of libraries from Ardour in Gentoo. I was, to be honest, angry after reading his comment, and I was tempted to answer badly for a while; but then I decided my health was more important and backed away, thought about it, then answered how I answered (which I hope is diplomatic enough). Then I thought it might be useful to address the problem in a less concise way and explain the details.

Ardour is bundling a series of libraries; like I wrote previously, there are problems related to this and we dealt with them by just unbundling the libraries, now Ardour is threatening to withdraw support from Gentoo as a whole if we don’t back away from that decision. I’ll try to address his comments in multiple parts, so that you can understand why it really upset me.

First problem: the oogie-boogie crashes<

It’s a quotation from Adam Savage from MythBusters, watch the show if you want to actually know the detail; I learnt about it from Irregular Webcomic years ago, but I have only seen it about six months ago, since in Italy it only passes on satellite pay TV, and the DVDs are not available (which is why they are in my wishlist).

Let’s see what exactly Paul said:

Many years ago (even before Gentoo existed, I think) we used to distribute Ardour without the various C++ libraries that are now included, and we wasted a ton of time tracking down wierd GUI behaviour, odd stack tracks and many other bizarre bugs that eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled.

I think that I now coined a term for my own dictionary, and will call this the syndrome of oogie-boogie bugs, for each time I hear (or I’m found muttering!) “we know of past bad behaviour”. Sorry but without documentation, these things are like unprovable myth, just like the one Adam commented upon (the “pyramid power”). I’m not saying that these things didn’t happen, far form that I’m sure they did, the problem is that they are not documented and thus are unprovable, and impossible to dissect and correct.

Also, I’m not blaming Paul or the Ardour team to be superficial, because, believe it or not, I suffer(ed, hopefully) from that syndrome myself: some time ago, I reported to Mart that I had maintainer mode-induced rebuilds on packages that patched both Makefile.am and Makefile.in, and that thus the method of patching both was not working; while I still maintain that it’s more consistent to always rebuild autotools (and I know I have to write on why is that), Mart pushed me into proving it, and together we were able to identify the problem: I was using XFS for my build directory, which has sub-second mtime precision, while he was using ext3 with mtime precise only to the second, so indeed I was experiencing difficulties he would never have been able to reproduce on his setup.

Just to show that this goes beyond this kind of problem, since I joined Gentoo, Luca told me to be wary about suggesting use of -O0 when debugging because it can cause stuff to miscompile. I never accepted his word for it because that’s just how I am, and he didn’t have any specifics to prove it. Turns out he wasn’t that wrong after all, since if you build FFmpeg with -O0 and Sun’s compiler, it cannot complete the link. The reason for this is that with older GCC, and Sun’s compiler, and others I’m sure, -O0 turns off the DCE (Dead Code Elimination) pass entirely, and cause branches like if (0) to be compiled anyway. FFmpeg relies on the DCE pass to always happen. (there is more to say about relying on the DCE pass but that’s another topic altogether).

So again, if you want to solve bugs of this kind, you have to just do like the actual Mythbusters: document, reproduce, dissect, fix (or document why you have to do something rather than just saying you have to do it). Not having the specifics of the problem, makes it an “oogie-boogie” bug and it’s impossible to deal with it.

Second problem: once upon a time

Let me repeat one particular piece of the previous quote from Paul Davis (emphasis mine): “Many years ago (even before Gentoo existed, I think)”. How many years ago is that? Well, since I don’t want to track down the data on our own site (I have to admit I found it appalling that we don’t have a “History” page), I’ll go around quoting Wikipedia. If we talk about Gentoo Linux with this very name, the 1.0 version has been released on 2002, March 31 (hey it’s almost seven years go by now). If we talk about Daniel’s project, Enoch Linux 0.75 was released in December 1999, which is more than nine years ago. I cannot seem to be able to confirm Paul’s memories since their Subversion repositories seems to have discarded the history information from when they were in CVS (it reports the first commit in 2005, which is certainly wrong if we consider that Wikipedia puts their “Initial Release” in 2004).

Is anything the same as it was at that time? Well, most likely there are still pieces of code that are older than that, but I don’t think any of those are in actual use nowadays. There have been, in particular, a lot of transitions since then. Are difficulties found at that time of any relevance nowadays? I sincerely don’t think so. Paul also don’t seem to have any documentation of newer happenings of this, and just says that they don’t want to spend more time on debugging these problems:

We simply cannot afford the time it takes to get into debugging problems with Gentoo users only to realize that its just another variation on the SYSLIBS=1 problem.

I’ll go around that statement in more details in the next problem, but for now let’s accept that there has been no documentation of new cases, and that all that it goes here is bad history. Let’s try to think about what that bad history was. We’re speaking about libraries, first of all, what does that bring us? If you’re an avid reader of my blog, you might remember what actually brought me to investigate bundled libraries in the first place: symbol collisions ! Indeed this is very likely, if you remember I did find one crash in xine due to the use of system FFmpeg, caused by symbol collisions. So it’s certainly not a far-fetched problem.

The Unix flat namespace to symbols is certainly one big issue that projects depending on many libraries have to deal with; and I admit there aren’t many tools that can deal with that. While my collision analysis work has focused up to now to identify the areas of problem, it only helps in the big scheme of things to find possible candidate to collision problems. This actually made me think that I should adapt my technique to identify problems in a much smaller scale, giving one executable in input and identifying duplicated symbols. I just added this to my TODO map.

Anyway, thinking about the amount of time passed since Gentoo’s creation (and thus what Paul think is when the problems started to happen), we can see that there is at least one big “event horizon” in GNU/Linux since then (and for once I use this term, because it’s proper to use it here): the libc5 to libc6 migration ; the HOWTO I’m linking, from Debian, was last edited in 1997, which puts it well in the timeframe that Paul described.

So it’s well possible that people at the time went to use libraries built for one C library with Ardour built with a different one, which would have created, almost certainly, subtle and difficult to identify (for a person not skilled with linkers at least) issues. And it’s certainly not the only possible cause of similar crashes, or even worse unexpected behaviour. If we look again at Paul’s comment, he speaks of “C++ libraries”; I know that Ardour is written in C++ and I think I remember some of the libraries being built being written in C++ too; I’m not sure if he’s right at calling all of them “C++ libraries” (C and C++ are two different languages, even if the foreign calling convention glue is embedded in the latter’s language), but given even a single one is as such, it can open a different Pandora’s vase.

See, if you look at GCC’s history, it wasn’t long before Enoch 0.75 release that a huge paradigm shift initiated for Free Software compilers. The GNU C Compiler, nowadays the GNU Compiler Collection, forked the Experimental/Enhanced GNU Compiler System (EGCS) in 1997, which was merged back into GCC with the historical release 2.95 in April 1999. EGCS contained a huge amount of changes, a lot related to C++. But even that wasn’t near perfection; for many, C++ support was mostly ready from prime time only after release 3 at least, so there were wild changes going on at that time. Libraries built with different versions of the compiler at the time might as well had wildly differently built symbols with the same name, and even worse, they would have been using different STL libraries. Add to the mix the infamous 2.96 release of GCC as shipped by RedHat, I think the worse faux-pas in the history of RedHat itself, with so many bugs due to backporting that a project I was working with at the time (NoX-Wizard) officially unsupported it, suggesting to use either 2.95 or 3.1. We even had an explicit #error out if the 2.96 release was used!

A smaller scale paradigm shift has happened with the release of GCC 3.4 and the change from libstdc++.so.5 to libstdc++.so.6 which is what we use nowadays. Mixing libraries using the two ABIs and the STL versions caused obvious and non-obvious crashes; we still have software using the older ABI, and that’s why we have libstdc++-v3 around; Mozilla, Sun and Blackdown hackers certainly remember that time because it was a huge mess for them. It’s one very common (and one of my favourite) arguments against the use of C++ for mission-critical system software.

Also, GCC’s backward compatibility is near non-existent: if you build something with GCC 4.3, without using static libraries, executing it on a system with GCC 4.2 will likely cause a huge amount of problems (forward compatibility is always ensured though). Which adds up to the picture I already painted. And do we want to talk about the visibility problem? (on a different note I should ask Steve for a dump of my old blog to merge here, it’s boring not remembering that one post was written on the old one).

I am thus not doubting at all of Paul’s memories regarding problems with system libraries and so on so forth. I also would stress another piece of his comment: “eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled”. I understand he might not actually just refer to the compiler (and compiler version) used in the build; so I wish to point out two particular GCC options: -f(no-)exceptions and -f(no-)rtti.

These two options enable or disable two C++ language features: exceptions handling and run-time type information. I can’t find any reference to that in the current man page, but I remember that it warned that mixing code built with and without it in the same software unit was bad. I wouldn’t expect it to be any different now sincerely. In general the problem is solved because each piece of software builds its own independent unit, in the form of executable or shared object, and the boundary between those is subject to the contract that we call ABI. Shared libraries built with and without those options are supposed to work fine together (I sincerely am not ready to bet though), but if the lower-level object files are mixed together, bad things may happen, and since we’re talking about computers, they will, in the moment you don’t want them to. It’s important to note here for all the developers not expert with linkers that static libraries (or more properly, static archives) are just a bunch of object files glued together, so linking something statically still means linking lower-level object files together.

So the relevance of Paul’s memories is, in my opinion, pretty low. Sure shit happened, and we can’t swear that it’ll never happen again (most likely it will), but we can deal with that, which brings me to the next problem:

Third problem: the knee-jerk reaction

Each time some bug happens that is difficult to pin down, it seems like any developer tries to shift the blame. Upstream. Downstream. Sidestream. Ad infinitum. As a spontaneous reflex.

This happens pretty often with distributions, especially with Gentoo that gives users “too much” freedom with their software, but most likely in general, and I think this is the most frequent reason for bundling libraries. By using system libraries developers lose what they think is “control” over their software, which in my opinion is often just sheer luck. Sometimes developers admit that their reasons are just desire to spend the less time possible working on issues, some other times they try to explicitly move the blame on the distributions or other projects, but at the end of the day the problem is just the same.

Free software is a moving target; you might developer software against a version of a library, not touch the code for a few months, it works great, and then a new version is released and your software stops working. And you blame the new release. You might be right (new bug introduced), or you might be wrong (you breached the “contract” called API, some change happened and something that was not guaranteed to work in any particular way changed the way it worked, and you relied on the old behaviour). In either case, the answer “I don’t give a damn, just use the old version” is a sign of something pretty wrong with your approach.

The Free Software spirit should be the spirit of collaboration. If a new release of a given dependency breaks your software, you should probably just contact the author and try to work out between the two project what the problem is; if it’s a bug introduced, make sure there is a testsuite, and that the testsuite includes a testcase for the particular issue you found. Writing testcases for bugs that happened in the past is exactly why testsuites are so useful. If the problem is that you relied on a behaviour that has changed, the author might know how not to rely on that and have code that work as expected, or might take steps to make sure nobody else tries that (either by improving documentation or changing the interface so that the behaviour is not exposed). Bundling the dependency citing multiple problems and giving no option is usually not the brightest step.

I’m all for giving working software to users by default, so I can understand bundling the library by default; I just think that it should either be documented why that’s the case or give a chance of not using it. Someone somewhere might actually be able to find what the problem is. Just give him a chance. In my previous encounter with Firefox’s SQLite, I received a mail from Benjamin Smedberg:

Mozilla requires a very specific version of sqlite that has specific compiled settings. We know that our builds don’t work with later or earlier versions, based on testing. This is why we don’t build against system libsqlite by design.

They know based on testings that they can’t work with anything else. What does that testing consists of, I still don’t know. Benjamin admitted he didn’t have the specifics, and relied me to Shawn Wilsher who supposedly had more details, but he never got back at me with those details. Which is quite sad since I was eager to find what the problem was because SQLite is one of the most frequent oogei-boogei sources. I even noted that the problem with SQLite seems to lie upstream, and I still maintain that in this case; while I said before that it’s a knee-jerk reaction, I also have witnessed to more than a few project having problems with SQLite, myself I had my share of headaches because of that. But this should really start make us think that maybe, just maybe, SQLite needs help.

But we’re not talking about SQLite here, and trust me that most upstreams will likely help you out to forwardport your code, fix issues and so on so forth. Even if you, for some reason I don’t want to talk about now, decided to change the upstream library after bundling, often times you can get it back to a vanilla state by pushing your changes upstream. I know it’s feasible even for the most difficult upstreams, because I have done just that with FFmpeg, with respect to xine’s copy.

But just so that we’re clear, it does not stop with libraries, the knee-jerk reaction happens with CFLAGS too; if you have many users reporting that using wild CFLAGS break your software, the most common reaction is to just disallow custom CFLAGS, while the reasoned approach would be to add a warning and then start to identify the culprit; it might be your code assuming something that is not always true, or it might be a compiler bug, in either case the solution is to fix the culprit instead of just disallowing anybody from making use of custom flags.

Solution: everybody’s share

So for now I dissected Paul’s comment into three main problems; I could probably write more about each of them, and I might if the points are not clear, but the post is already long enough (but I didn’t want to split it down because it would take too long to be available), and I wanted to reach a conclusion with a solution, which is what I already posted in my reply to the bug.

The solution to this problem is to give everybody something to do. Instead of “blacklisting Gentoo” like Paul proposed, they should just do the right thing and leave us to deal with the problems caused by our choices and our needs. I have already pointed out some of these in my three-parts article for LWN (part 1, part 2 and part 3). This means that if you get user reporting some weird behaviour, using the Gentoo ebuild, your answer should not be “Die!” but “You should report that to the Gentoo folks over at their bugzilla”. Yes I know it is a much longer phrase and that it requires much more typing, but it’s much more user friendly and actually provides us all with a way to improve the situation.

Or you could also do the humble thing and ask for help. I already said that before, but if you got problem with anything I have written about, and have a good documentation of what the problem is, you can write me; of course I don’t always have time to fix your issues, sometimes I don’t even have time to look at them in a timely fashion I’m afraid, but I never sent away someone because I didn’t like them. The problem is that most of the time I’m not asked at all.

Even if you might end up asking me some question that would be very silly if you knew the topic, I’m not offended by those; just like I’d rather not be asked to learn all about the theory behind psychoacoustic to find why libfaad is shrieking my music, I don’t pretend that Paul knows all the inside out of linking problems to find out why the system libraries cause problems. I (and others like me) have the expertise to identify relatively quickly a collision problem; I should also be able to provide tools to identify that more quickly. But if I don’t know of the problem, I cannot magically fix it; well, not always at least .

So Paul, this is an official offer; if you can give me the details of even a single crash or misbehaviour due to the use of system libraries, I’d be happy to look into it.

Is Firefox really that bad?

When I’ve read some rants about Firefox I thought they were a little bit too much. Now, I start to wonder if they were quite to the point instead. But before I start I have to say I haven’t tried contacting anybody yet, neither from the Gentoo Mozilla team not upstream. And I’m sure the Gentoo Mozilla team are doing their best to make sure that they can provide a working Firefox still following upstream guidelines on trademarks.

This actually sprouted from my previous work inspecting library paths; I went to check which libraries for firefox-bin were loaded from the system library directory, and noticed one curious thing: /usr/lib/libsqlite3.so was being loaded. What’s the problem? The problem is that I knew that xulrunner (at least built from sources) bundles its own copy of SQLite3, so I wondered if they used the system copy for the binary package. Funnily enough, they really don’t:

yamato link-collisions # ldd /opt/firefox/firefox-bin | grep sqlite3
    libsqlite3.so => /opt/firefox/libsqlite3.so (0xf67e7000)
    libsqlite3.so.0 => /usr/lib/libsqlite3.so.0 (0xf621e000)
yamato link-collisions # lddtree.sh /opt/firefox/firefox-bin | grep sqlite3 -B1
    libxul.so => /opt/firefox/libxul.so
        libsqlite3.so => /opt/firefox/libsqlite3.so
--
        libsoftokn3.so => /usr/lib/nss/libsoftokn3.so
            libsqlite3.so.0 => /usr/lib/libsqlite3.so.0

(The lddtree.sh script comes from pax-utils and uses scanelf. I have a similar script in my Ruby-Elf suite implemented as a testcase, it produces the same results, basically.)

So the binary version of the package uses the system copy of NSS and thus loads the system copy of SQLite3. I haven’t gone as far as checking where the symbols were resolved, but one of the two is going to be loaded and unused, wasting memory (clean and dirty, for relocated data sections). Not nice, but one can say it’s the default binary, and has to know to adapt. In truth the problem here is that upstream didn’t use rpath, and thus the firefox-bin program does not load all its libraries from the /opt/firefox directory (since the /usr/lib/nss directory comes first). Had they built their binary with rpath set to $ORIGIN it would have loaded everything from /opt/firefox without caring about the system libraries, like it was intended to do. Interestingly enough, they do just that for Solaris, but not for Linux where they prefer fiddling with LD_LIBRARY_PATH.

Next, I checked the /usr/bin/firefox started, which I already copied on the other post:

#!/bin/sh
export LD_LIBRARY_PATH="/usr/lib64/mozilla-firefox"
exec "/usr/lib64/mozilla-firefox"/firefox "$@"

Let’s ignore the problem with the rewriting of the environment variable, which I don’t care about right now, and check what it does. It adds the /usr/lib64/mozilla-firefox directory to the list of paths to load libraries from. Since it’s setting LD_LIBRARY_PATH all the library resolutions will have to be done manually rather than using the ld.so.cache file. So I checked which libraries it loads from there:

flame@yamato ~ % LD_LIBRARY_PATH=/usr/lib64/mozilla-firefox ldd /usr/lib64/mozilla-firefox/firefox | grep mozilla-firefox
flame@yamato ~ % scanelf -E ET_DYN /usr/lib64/mozilla-firefox 
 TYPE   FILE 
ET_DYN /usr/lib64/mozilla-firefox/libjemalloc.so 

(The second commands finds all the libraries in the given path, by checking for ET_DYN, dynamic ELF, files.)

Okay so there is one library, but it’s not in the NEEDED lines of the firefox executable. Indeed that library is a preloadable library with a different malloc() implementation (remember I’ve written about similar things and commented about FreeBSD solution), which means it has to be passed through LD_PRELOAD to be useful, and I can’t see that to be used at all. Indeed, if I check the loaded libraries on my firefox process I can’t find it:

flame@yamato x86 % fgrep jemalloc /proc/`pidof firefox`/smaps
flame@yamato x86 % 

Let’s go step by step though, for now we can say with enough safety that the loader is overwriting LD_LIBRARY_PATH with no apparent good reason. Which libraries does the firefox executable load then?

flame@yamato ~ % LD_LIBRARY_PATH=/usr/lib64/mozilla-firefox ldd /usr/lib64/mozilla-firefox/firefox
    linux-vdso.so.1 =>  (0x00007fffcabfd000)
    libdl.so.2 => /lib/libdl.so.2 (0x00007fa5c2647000)
    libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.3.3/libstdc++.so.6 (0x00007fa5c2338000)
    libc.so.6 => /lib/libc.so.6 (0x00007fa5c1fc5000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fa5c284b000)
    libm.so.6 => /lib/libm.so.6 (0x00007fa5c1d40000)
    libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fa5c1b28000)
flame@yamato ~ % scanelf -n /usr/lib64/mozilla-firefox/firefox 
 TYPE   NEEDED FILE 
ET_EXEC libdl.so.2,libstdc++.so.6,libc.so.6 /usr/lib64/mozilla-firefox/firefox 

It can’t be right, can it? We know that Firefox loads GTK+ and a bunch of other libraries, starting with xulrunner itself, but there is no link to those. But if you know your linker you should notice a funny thing: libdl.so.2. It means the exeutable is calling into the loader at runtime, which usually means dlopen() is used. Indeed it seems like the firefox executable loads the actual browser at runtime, as you can see by checking the smaps file.

Now there are two things to say here: there is a reason why firefox would be doing that, and the reason is that calling “firefox” with it open already should actually request a new window to be opened, rather than opening a new process. So basically I expect the executable to contain a launcher, that if a copy of firefox is running already just tells that to open a new window, and otherwise loads all the libraries and stuff. It’s a good idea, from one point of view because initialising all the graphical and rendering libraries just to tell another process to open a window would be a waste of resources. On the other hand, dlopen() is not the best performing approach and also creates problem to prelink.

I have no idea why it happens, but the binary package as released by upstream provides a script that seems to be taking care of the launching, and then a firefox-bin executable that doesn’t use dlopen() to load the Gecko engine and all the graphical user interface. I would very much like to know why we don’t do the same for from-source builds, I would sincerely expect that the results would be even better when using prelink and similar.

Now, let’s return a moment to the problem of the SQLite3 loaded twice for the binary release of Firefox, surely the same wouldn’t happen for the from-source version, would it? Check it by yourself:

flame@yamato x86 % fgrep sqlite /proc/`pidof firefox`/smaps
7fea6c8c2000-7fea6c935000 r-xp 00000000 fd:08 701632                     /usr/lib64/libsqlite3.so.0.8.6
7fea6c935000-7fea6cb35000 ---p 00073000 fd:08 701632                     /usr/lib64/libsqlite3.so.0.8.6
7fea6cb35000-7fea6cb36000 r--p 00073000 fd:08 701632                     /usr/lib64/libsqlite3.so.0.8.6
7fea6cb36000-7fea6cb38000 rw-p 00074000 fd:08 701632                     /usr/lib64/libsqlite3.so.0.8.6
7fea814dc000-7fea8154f000 r-xp 00000000 fd:08 24920                      /usr/lib64/xulrunner-1.9/libsqlite3.so
7fea8154f000-7fea8174f000 ---p 00073000 fd:08 24920                      /usr/lib64/xulrunner-1.9/libsqlite3.so
7fea8174f000-7fea81751000 r--p 00073000 fd:08 24920                      /usr/lib64/xulrunner-1.9/libsqlite3.so
7fea81751000-7fea81752000 rw-p 00075000 fd:08 24920                      /usr/lib64/xulrunner-1.9/libsqlite3.so

Yes, yes it does happen. So I have a process that is loading one library for no good reason at all at runtime, and not a little one at that, when it could probably, at this point, use a single system SQLite library. I say that it could, because now I have enough evidence to support that: if the two libraries had a different ABI, depending on which one the symbols resolve to, either xulrunner or NSS would be crashing down. Since ELF uses a flat namespace, the same symbol name cannot be resolved in two different libraries, and thus one of the two libraries using them would find them in the “ẅrong” copy. And no, before you ask, neither use symbol versioning.

So at this point the question is: can both Firefox upstream and the Gentoo Firefox ebuild start providing something that does more than just working and actually works properly?

On the road to Free Java – a story

After my post about the long road to Free Java I’ve tried to inquiry everybody who might have a clue about it and found what the root cause of the problem was.

Basically, when IcedTea6 is built, it has to bootstrap itself, so it first builds itself with the JDK you provide (gcj-jdk) and then it rebuilds it with icedtea6 itself; but to rebuild itself, it sets the JAVA_HOME variable during build, hoping for ant to pick it up. But by choice of the Gentoo Java team, the JAVA_HOME variable is not supported nor respected, so the override fails, and it tries to build itself still with the previous compiler, the wrong one.

How can this work for anybody then, like Andrew said ? Well the trick is in the keyword you use. On stable systems, ant-core-1.7.0-r3 from the Java overlay is picked up, which contains an hack from Andrew (no you cannot call it “the proper way” since it does not fix the comments; if your idea of hack does not encompass doing a change and leave the opposite comments still there, then I start to worry…) to allow respecting JAVA_HOME. If you are on unstable systems, you’re going to get ant-core-1.7.1 from the main tree, that version does not have the hack, and thus will fail to build IcedTea6. I’m not sure where David Philippi have seen ant-core-1.7.1-r1 from java-overlay, since it still has the old version.

So I decided that even if it does not conform to my usual QA strictness, I wanted to try out IcedTea6. The reason for that is that I’m addicted to Yahoo Games and I haven’t found any free software package yet that supports playing online to Canasta, for instance… and I was tired to use the laptop for that since I have Yamato here all the time. I then disabled my --as-needed compiler (the build system fails when it comes to properly order the linking lines), installed the hacked ant-core, and merged icedtea6.

This time 1.3.1-r1 finally merged and I could try it out, good! about:plugins on Firefox shows me that it’s picked up, but … once I get to Yahoo games page, it does not really work: the “table” window opens, but then it does not load the applet, it goes in timeout and tries to reload; does so a few times, then Yahoo tells you to disable popup blockers.

I tried a couple more applets along the line, but it still failed quite badly, crashing a couple of time. Yeah we’re on the road to a Free Java, but we’re certainly not there yet.

On the other hand, if somebody knows how to debug problems like the ones I described above, I’d be glad to provide more information to the icedtea/openjdk developers to see that they get resolved and we can finally have a working nsplugin on AMD64.

This is not the way to manage an Operating System

One would think that when you develop an Operating System, one of your objectives is to get it supported by as many software as you can, or at least you care. One of the things you might be trying is to make the developers of those software so that they can take care of that for you.

Unfortunately seems like FreeBSD people have not my same idea of how the thing has to be done. A simple way to check this is by looking at the nspr problem I found yesterday.

I got an emerge -e world running on Defiant after 6.2 update, and the only package to fail because of the update was dev-util/nspr. Why? Because a function was added in 6.2 series (getprotoname_r) that before was replaced by an in-line copy. The patch is trivial, and I submitted it upstream in Mozilla bug #354305 .

Now, just as an informative check, I wanted to look what ports did to that package, and gone on devel/nspr page … a similar patch to the one I’ve submitted already was present in the port since July! Version 4.6.1 of nspr. And never sent upstream.

Sigh.

By the way, this is yet another case where autotools beats the preprocessor tests: just AC_CHECK_FUNCS([getprocbyname_r]) and you’ll be fine.

Cannot sleep…

… well not a news, everybody who followed my blog in the past months knows that I suffer from insomnia from time to time. Tonight I wasn’t expecting it, tho.

So, as I’m here, I want to at least get a bit of updates on what I’m doing and why :P

First of all, last night new minor versions of libtorrent and rtorrent were released, you can find 0.10 and 0.6 in portage now already, and they seem to work good too. After this release, I also asked for stable on the 0.9.30.5.3 versions, and deleted all the other lowe versions but the current stable, so that I can reduce the number of ebuilds on CVS and at the same time provide a suitable stable ebuilds (rtorrent and libtorrent proved completely stable for me up to now).

Today I also revbumped (again, I know) vlc to fix a problem with the newly-readded nsplugin, now it should work fine. I also took the time with that to change one “little not so little” thing: FFmpeg now is mandatory, no more ffmpeg useflag, as VLC is kinda pointless without that. Also the next time I’ll be bumping xine-lib I’ll make external FFmpeg mandatory, as it’s proving stable and I can make sure the version in portage is always working with the xine-lib version we have.

You might also see that I’m trying to fix dependencies on virtual/x11 to ‘<virtual/x11-7’, as suggested by Jakub, to try to avoid problems when users have virtual/x11-7 merged (luckily, I don’t ;).

On the Gentoo/FreeBSD front, I gave up on dbus for now, I’ll work on that when i have more time, and I also gave up with lua because the current versions are evil and they are now in mask for soooo much time that I don’t think it’s worth spending time on that. I keyworded Firefox 1.5 and Seamonkey tho (and it’s just me who find the monolithic Seamonkey more performant than Firefox that was created to be faster than the old monolithic Mozilla? If I’m not the only one, I can start to say Mozilla people screwed up big time with that).

The ruby-hunspell bindings are working, but I haven’t written any documentation about them yet so they are still lingering around in GIT rather than being released. I hope to do soon. I should also update the website, and that’s a bit annoying from my point of view.

I’m thinking of updating the stage for Gentoo/FreeBSD too, as there are quite a few things that changed, starting from GCC and Binutils versions, that are now respectively 4.1.1 and 2.17. I should really do that.
But I’m having trouble with the crosscompiler right now and this ain’t good at all, so I have to resolve that first (so that building the stage won’t take two days).

Least thing I’ll say, is that net month I’ll probably take two weeks off, for basically the first time since I joined Gentoo last year! Not a big deal, no holidays, nothing really relaxing, I’ll just go to stay at my sister’s place for two weeks while her husband is out of town for his job, but I’ll be unavailable for the two weeks as I’m not going to have a net connection nor anything else. I’ll be reading my books, and trying to apply the Qigong exercise I read about, that are not easy to actually do daily as long as I’m at home.

And there it is…

Okay, I’ve committed vlc 0.8.5-r4 in Portage, with the restored nsplugin useflag and a new seamonkey useflag. By default it links against Firefox, and it’s done with it.
If you enable seamonkey useflag, it enables seamonkey, of course. The patch I’ve prepared does not apply to the current SVN code, but I do have one that applies, I just need to clean it up a bit and then send it upstream (I already shown it to xtophe on #videolan).

Now, just to let people know, begins the hard part: I’m not a Mozilla user, I use Konqueror daily, I use Camino on the iBook rarely, but mostly I use Safari there too (when I’m on OSX, that is, else I use Konqueror), and sometimes I use Opera if I need 32-bit support.

This means that I won’t be able to help if there are problems with the plugin itself, that will be probably marked as UPSTREAM wither for VLC or for Mozilla.. or will ask someone else to co-maintain it. So if any dev would like to help out with VLC, I’m all and ready to hand it over, even entirely if needed ;)

And now to proceed with my Radeon adventure, this box seems to have now –12°C when compared to the other day, and I’d be ready to bet the card change is responsible. Not a bad thing at all!

Okay, let’s see what else I have on my TO DO list for tonight. Sigh.