For A Parallel World. Home Exercise n.1: a drop-in dynamic replacement for memcpy()

Since I’ve written about OpenMP I’ve been trying to find time to test it on real usage scenarios; unfortunately between health being far from optimal the past few days with general aches, and work piling up, I haven’t been able to get to work on it at all. It would be much nicer if I could get a job that would allow me to spend time on these things, but I don’t want to rant about that since I’m happy to have jobs from time to time, as it is.

Yesterday I toyed a bit around with OpenMP and xine-lib, I wanted to try implementing a memcpy() replacement that could use OpenMP to work in parallel, it’s especially useful for multimedia applications. Beside some issues with autotools and OpenMP which I’m going to address in a future post, I ended up with a few more things in my mind (the usual problem with trying out new things: you want to achieve one result, and you get material for other three tests; now I know why Mythbusters starts with one idea and then end up doing four or five similar tests).

My parallel memcpy() replacement was just as lame as my byteswapping attempt from before, a single for parallelised with the proper pragma code. Just to make it not too lame, I used 64-bit copies (unaligned, but I would expect that not to matter on x86-64 at least, it was just a test). The reason why I didn’t go for a less lame method is that from a second test on byteswapping, which I didn’t have time to write about yet, I found that using more complex tricks does not really help. While splitting the memory area to swap in X equally-sized areas, with X being the maximum number of threads OpenMP is managing, identified dynamically, caused a slight improvement on the huge areas (half a gigabyte and a full gigabyte), it didn’t make any acceptable difference (considering the cost of the more complex code) on smaller blocks, and I really doubt that such huge memory areas would ever go swapped all at once. Splitting the area in page-sized (4KiB) blocks actually made the code slower, I guess, since I didn’t go deeper to check, that the problem there is that the threads are usually all executed on either the same core or on anyway on the same CPU, which means that the pages are all mapped on the memory directly connected to that CPU; splitting it up in pages might make it swap between the two different CPUs and thus make it slower. I’ll look more deeply into that when I have time.

Unfortunately, using this particular memcpy() implementation didn’t let me start xine properly, I probably made a mistake, maybe unaligned 64-bit copies on x86-64 don’t work, just like on any other architecture, but I didn’t go around trying to fix that for the very same reason why I’m writing this post.

It turns out that xine, just like MPlayer, and other multimedia application, have their own implementation of “fast memcpy()”, using SIMD instructions (MMX, MMXEXT, SSE, AltiVec, …). They benchmark at runtime which one has the best result (on my system it’s either the Linux kernel implementation, not sure which version, or the MMX version), and then they use that. This has some problems that are obvious, and some that are much less obvious. The first problem is that the various implementations do have to take care of some similar issues which cause code duplication (handling of small memory area copies, handling of unaligned copies and so on). The second is much more subtle and it’s what I think is a main issue to be solved.

When a programmer in a C program uses functions like memcpy(), strlen() and others, the compilation process (with optimisations) will hit some particular code called “builtins”. Basically the compiler knows how to deal with it, and emits different machine code depending on the way the function is called. This usually happens when the parameters to the call are known at build time, because they are either constant or can be derived (for static functions) from the way the function is called. How this affects mathematical functions and functions like strlen() can be better understood reading an old article of mine; for what concerns memcpy(), I’ll try to be brief but explain it here.

Let’s take a very easy function that copies an opaque type that is, in all truth, a 64-bit data field:

#include 

void copy_opaque(void *to, void *from) {
  memcpy(to, from, 8);
}

Building this code on x86-64 with GCC 4.3 and no optimisation enabled will produce this code:

copy_opaque:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    -16(%rbp), %rsi
        movq    -8(%rbp), %rdi
        movl    $8, %edx
        call    memcpy
        leave
        ret

As you can see there is a call to memcpy() after setting up the parameters, just like one would expect. But turn on the optimisation with -O2 and the resulting code is quite different:

copy_opaque:
        movq    (%rsi), %rax
        movq    %rax, (%rdi)
        ret

The function has been reduced to two instructions, plus the return, with no stack usage. This because the compiler knows that for 64-bit copies, it can just emit straight memory access and simplify the code quite a bit. The memcpy() function is not a static inline, but the compiler knows its interface and can produce optimised code just fine when using builtin. Similarly, when using -O2 -fno-builtin to ask the compiler not to use builtins knowledge, for instance because you’re using special access functions, you can see that the resulting code is still composed of two instructions, but of a different type:

copy_opaque:
        movl    $8, %edx
        jmp     memcpy

Let’s go back to the builtin though, since that’s what it’s important to know before I can explain why the dynamically-chosen implementation in xine and company is quite suboptimal.

When you change the size of the memory area to copy in copy_opaque() from 8 to a different constant, you can see that the code changes accordingly. If you use a number that is not a multiple of 8 (that is the biggest size that x86-64 can deal with without SIMD), you can see that the “tail” of the area is copied using smaller move operations, but it’s still expanded. If you compare the output with multiple power-of-two values, you can see that up to 128 it inlines multiple movq instructions, while starting with 256, it uses rep movsq. With very big values, like (1 << 20), the compiler emits a straight memcpy() call. This is because the compiler can assess the overhead of the call and decide when it’s big enough to use a function rather than inlining code.

It can also decide this based on what type of optimisation is requested, for instance I said above that rep movsq starts to get used after the value 256 (1 << 8), but that was intended with the -O2 level; with -Os, it’s already when you have more than two 64-bit words.

Since the library functions like memcpy() and similar are very widely used, the fact that the compiler can emit much simpler code is very useful. But this works just as long as the compiler knows about them. As I said, turning off the builtin replacement will cause the code to be compiled “literally” with a call to the function, which might have a much higher overhead than a straight copy. Now it might be quite easier to grasp what the problem is with the dynamic memcpy() replacement used by xine and other software.

Let’s change the code above to something like this:

#include 

extern void *fast_memcpy(void *to, void *from, size_t foo);

void copy_opaque(void *to, void *from, size_t foo) {
  fast_memcpy(to, from, 8);
}

Now, even turning on the optimisations, won’t make any difference, the compiler will always emit a call to memcpy():

copy_opaque:
        movl    $8, %edx
        jmp     fast_memcpy

As you might guess, this is certainly slower than the straight copy that we had before, even if the memcpy() replacement is blazing fast. The jump will also require a symbol resolution since fast_memcpy() is not statically defined, so it’ll have to pass through the PLT (Procedure Linking Table) which is an expensive operation. Even if the symbol were defined internally to the same library, this would still most likely cause a call to the GOT (Global Offset Table) for shared objects.

By redefining the memcpy() function, xine and others are actually slowing the code down, at least when the size of the copy is known, a constant at build time. GCC extensions actually allow to define a macro, or even better a static inline function, that can discern whether a compile-time constant is used, and then fall back to the original memcpy() call, which the compiler will mangle as it prefers, but this is quite complex, and in my opinion not worth the bother.

Why do I say this? Well the first issue is that sometimes even if a value is not properly a constant at build-time, the compiler can find some particular code path where the function can be replaced, and thus emit adaptive code. The second is that you might just as well always use properly optimised memcpy() functions when needed, and if the C library does not provide anything as fast, you just need to use the Force ELF.

When the C library does not provide functions optimised for your architecture, for compatibility or any other reason, you can try to replace them through the ELF feature called symbol interposing, which basically work in the same as symbol collisions (I have some “slides” I’ve been working on for a while on the subject, but I’ll talk more extensively about this in a future post), and allows to intercept or replace calls to C library functions. It’s the same method used to implement the sandbox used by Portage, or the OSS wrappers for ALSA, PulseAudio, ESounD and so on.

What I’d like to see is a configurable library that would allow to choose between different memcpy() implementations, maybe on a per-size basis too, parallel and non-parallel, at runtime, through a file in /etc. This is quite possible, and similar features, with replacement for many common library functions, are available with freevec (which unfortunately only implements AltiVec on 32-bit PPC). But a more arch-neutral way to handle this would most likely help out.

Anyway, if somebody is up to take the challenge, I’d be glad to test, and to add to the tinderbox to test on the packages’ testsuites too. Let’s see what happens.

Why foreign language bindings shouldn’t live with their parents

Gentoo users probably know better than other distribution users the pain that is with foreign language bindings (C++, Python, Ruby, …) for various libraries. But these pains often are caused by what I think is a mistake in packages design: letting the foreign language bindings to “live with their parents”, that is with the original library that the bindings are bound to.

The first issue here is now we implement these cases in Gentoo: this is usually done through USE flags with the name of the language, like ruby, python, cxx, which enable building and installing the bindings; this also means that if a software is written using those bindings it has to use EAPI 2 USE dependencies or built_with_use checks. This is certainly not optimal, but I cannot blame upstream to the way we have to handle it, can I?

What I think the problem is, is that by having bindings and original library in the same packaging you’re forcing the same release cycle on two products that have lives that should be mostly independent. If you have to fix a nasty security bug in the underlying library, you don’t really need to release or rebuild the actual bindings. If you need to release new bindings to go with a new language version for things like Python and Ruby, you most likely not have to release a new version of the underlying library.

In general, you can expect that as long as the interface of the underlying library is stable, the bindings can remain the same even after ten bugfix releases of the libraries; or if the library is stable by itself, and the bindings aren’t, they can be released ten times for a single library release. This goes well in the name of reducing the amount of code that a single release affects.

The reason why reducing the amount of code a single release affect is not just that source-based distribution like Gentoo want to reduce the build time for their users, but it also goes to make it much easier for the various distribution to test the code, and ensures that unrelated changes get merged in with important fixes. This is especially useful for distributions with stable branches like Debian and Gentoo.

So please, when you write a library and want to add further bindings, create further packages, pretty please!

Why strcasecmp() and similar functions should not be used when parsing data

It might sound obvious to most experienced programmers, but it certainly is not obvious to most, which I’m afraid is a very bad thing since I’d really like to expect people who write code to understand at least a little bit of logic behind it.

I’m not going to talk about the problems regarding case insensitive comparison and locale settings (just remember that i and I are not the same character in Turkish), which still I expect most developers to ignore, but totally beside the point here, they are justified by not being linguists (unless they are Turks and then I’d worry).

What I’m talking about is the logic behind the comparison at all. In a normal string comparison you have a very easy workflow, each character of the string is compared, drop by at the first one that differs, and finish when they both arrive to the end. When you want to compare two strings case-independently, the comparison cannot just happen over the characters by themselves, they have to have the same case.

To achieve that you have many different options: lookup equivalence tables (up to 256 by 256 elements for ascii), lookup case-changing tables (twice), check if the character is in a given range, and so on. At any rate, it’s much more work than a simple comparison.

You can expect the library you’re using to be optimised enough so that the comparison does not take too long, so using strcasecmp() for a one-shot comparison is fine. What is not fine is, though, when you do parsing using it, like taking some token out of a file, and then start comparing it case-insentive to a series of known tokens. That’s a no-no since you’re going to require lookups or transformations many times in a row.

The easy way out of this is to ensure that all the reference tokens have a given case (lowercase or uppercase does not matter), and then convert the read token to the same case, so that you can just use the standard, fast, and absolutely non-complex case-sensitive string comparison.

It’s not that difficult, is it?

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Macros versus static inline

In my post about glib byte swapping functions I pointed out the inconsistent behaviour of glib’s macros for byteswapping, when it comes down to argument evaluation. If you look at the comments in that post, you can see that Paul points out that a very useful way to do the same thing without the problem of inconsistencies is simply to use inline functions rather than macro. I agree with him that it’s a much more useful thing to do.

Indeed, static inline functions, when the compiler supports them properly (which means, not with Sun Studio compiler), are quite better than macros: their arguments are evaluated just once, always, consistently, they have their own scope for variable names so that you can’t introduce a mystic error in your code by using the same name as a local variable in a macro, they leave the compiler free to decide what the best course of action is for them, for instance, in a loop, and so on. Additionally, they make debug a little easier than macros; when I fixed my problem with scanelf I had a segmentation fault inside a macro call for a 17 lines macro. Not exactly the nicest thing to debug (and actually made me quite wary of scanelf, I start not to like it too much, considered that complexity of the code), as you might guess.

So at the end I decided to go for it, and I submitted a patch to glib that replaces all the byteswapping macros with static inline functions; the code emitted is the same with GCC 4.3 but they are in my opinion more readable and they have more chances to work with non-GNU compilers (the previous implementation for them used the __extension__ keyword that as you might guess is a GNU extension, while static inline is a standard C99 feature).

Hopefully the patch will be merged soon, and at least the newer versions of glib will have a behaviour much more consistent with other libraries too. Now, if I can get to actually benchmark base64 encoding and decoding, and digesting through MD5 and SHA1, comparing glib with libavutil, I’d know what to look at for improvement.

Ruby-Elf and multiple compilers

I’ve written about supporting multiple compilers; I’ve written about testing stuff on OpenSolaris and I have written about wanting to support Sun extensions in Ruby-Elf.

Today I wish to write about the way I’m currently losing my head to get Ruby-Elf testsuite to apply over other compilres beside GCC. Since I’ve been implementing some new features for missingstatic upon request (by Emanuele “exg”), I decided to add some more tests, in particular considering Sun and Intel compilers that I decided to support for FFmpeg at least.

The new tests not only apply the already-present generic ELF tests (but rewritten and improved, so that I can extend them much more quickly) over files built with ICC and Sun Studio under Linux/AMD64, but also adds tests to check for the nm(1)-like code on a catalogue of different symbols.

The results are interesting in my view:

  • Sun Studio does not generate .data.rel sections, it only generates a single .picdata section, which is not divided between read-only and read-write (which might have bad results with prelinking);
  • Sun Studio also emits uninitialised non-static TLS variables as common symbols rather than in .tbss (this sounds like a mistake to me sincerely!);
  • the Intel C Compiler enables optimisation by default;
  • it also optimises out unused static symbols with -O0;
  • and even with __attribute__((used)) it optimises out static uninitialised variables (both TLS and non-TLS);
  • oh and it puts a “.0” suffix to the name of unit-static data symbols (I guess to discern between them and function-static symbols, that usually have a code after them);
  • and least but not last: ICC does not emit a .data.rel section, nor a .picdata section: everything is emitted in .data section. This means that if you’re building something with ICC and expect cowstats to work on it, then you’re out of luck; but it’s not just that, it also means that prelinking will not help you at all to reduce memory usage, just a bit to reduce startup time.

Fixing up some stuff for Sun Studio was easy, and now cowstats will work fine even under Sun Studio compiled source code, taking care of ICC quirks not so much, and also meant wasting time.

On the other hand, there is one new feature to missingstatic: now it shows the nm(1)-like symbol near the symbols that are identified as missing the static modifier, this way you can tell if it’s a function, or constant, or a variable.

And of course, there are two manpages: missingstatic(1) and cowstats(1) (DocBook 5 rulez!) that describe the options and some of the working of the two tools; hopefully I’ll write more documentation in the next weeks and that’ll help Ruby-Elf being accepted and used. Once I have enough documentation about it I might actually decide to release something. — I’m also considering the idea of routing --help to man like git commands do.

Between Mono and Java

Some time ago I expressed my feelings about C# ; to sum them up, I think it’s a nice language, by itself. It’s near enough C to be understandable by most developers who ever worked with that or C++ and it’s much saner than C++ in my opinion.

But I haven’t said much about Mono, even though I’ve been running GNOME for a while now and of course I’ve been using F-spot and, as Tante suggested, gnome-do.

I’ve been thinking about writing something about this since he also posted about Mono, but I think today is the best day of all, as there has been some interesting news in Java land.

While I do see that Mono has improved hugely since I last tried it (for Beagle), I do still have some reserves against Mono/.NET when compared with Java.

The reason for this is not that I think Mono cannot improve or that Java is technically superior, it’s more that I’m glad Sun finally covered the Java crap. OpenJDK was a very good step further, as it opened most of the important parts of the source code for others. But it also became more interesting in the last few days.

First, Sun accepted the FreeBSD port of their JDK into OpenJDK (which is a very good thing for the Gentoo/FreeBSD project!), and then a Darwin port was merged in OpenJDK. Lovely, Sun is taking the right steps to come out of the crap)

In particular, the porters project is something I would have liked to get involved in, if it wasn’t for last year’s health disaster.

In general, I think Java has now much more chances to become the true high-level multiplatform language and environment, over C# and Mono. This because the main implementation is open, rather than having one (or more) open implementations trying to track down the first and main implementation.

But I’d be seriously interested on a C# compiler that didn’t need Mono runtime, kinda like Vala.

System headers and compiler warnings

I wish to thank Emanuele (exg) for discussing about this problem last night, I thought a bit about it, checked xine-lib in this regard, and then decided to write something.

Not everybody might know this, but GCC, albeit reporting tons of useful warnings, especially in newer versions, together with a few warnings that can get easily annoying and rarely useful, like pointer signs, ignores system headers when doing its magic.

What does this mean? Well, when a system library install an header that would trigger warnings, they are by default ignored. This is useful because while you’re working on the application foo you won’t care what glibc devs did. On the other hand, sometimes these warnings are useful.

What Emanuele found was missing by ignoring the warnings caused by the system headers was a redefinition of a termios value in Emacs for Darwin (OS X). I checked for similar issues in xine-lib and found a few that I’ll have to fix soonish.

I’m not sure how GCC handles the warnings suppression, I think it’s worth opening a bug for them to change its behaviour here, though, as the redefinition is a warning caused by the program’s code, not by the system headers.

Now of course I hear the KDE users to think “but I do get warnings from the system headers”, in reference to KDE’s headers. Well, yes:

In file included from /usr/kde/3.5/include/kaboutdata.h:24,
                 from main.cpp:17:
/usr/qt/3/include/qimage.h: In member function ‘bool QImageTextKeyLang::operator<(const QImageTextKeyLang&) const':
/usr/qt/3/include/qimage.h:58: warning: suggest parentheses around && within ||

this warning I took from yakuake build, but it’s the same for every KDE package you merge, more or less. It’s a warning caused by an include, a library include, but in general the same rules apply for those.

Why is not not suppressed? The problem is in how the inclusion of the path happens. Which is probably my main beef against system headers warning suppression: it’s inconsistent.

By default the includes in /usr/include (and thus found without adding any -I switch) get their warnings suppressed. If a library (say, libmad) installs its headers there, it will get its warnings suppressed.

On the other hand if a library installs its headers in an alternative path, like /usr/qt/3/include in the example above, or a more common /usr/include/foobar, then it depends on how that directory is added to the path of searched directories. If it’s added through -I (almost every case) its warnings will be kept; they would be suppressed only if you use -isystem. Almost no software uses that option that, as far as I know, it’s gcc specific.

So whether a library will have the warnings caused by its headers suppressed or not, depends on the path. Nice uh? I don’t think so.

More work!

Sub-optimal optimisations?

While writing Implications of pure and constant functions I’ve been testing some code that I was expecting to be optimised by GCC. I was surprised to find a lot of my testcases were not optimised at all.

I’m sincerely not sure whether these are due to errors on GCC, to me expecting the compiler to be smarter than it can feasibly be right now, or to the “optimised” code to be more expensive than the code that is actually being generated.

Take for instance this code:

int somepurefunction(char *str, int n)
  __attribute__((pure));

#define NUMTYPE1 12
#define NUMTYPE2 15
#define NUMTYPE3 12

int testfunction(char *param, int type) {
  switch(type) {
  case 1:
    return somepurefunction(param, NUMTYPE1);
  case 2:
    return somepurefunction(param, NUMTYPE2);
  case 3:
    return somepurefunction(param, NUMTYPE3);
  }

  return -1;
}

I was expecting in this case the compiler to identify cases 1 and 3 as identical (by coincidence) and then merge them in a single branch. This would have made debugging quite hard actually (as you wouldn’t be able to discern the two case) but it’s a nice reduction on code, I think. Neither on x86_64 nor on Blackfin, neither 4.2 nor 4.3 actually merge the two cases leaving the double code in there.

Another piece of code that wasn’t optimised as I was expecting it to be is this:

unsigned long my_strlen(const char *str)
  __attribute__((pure));
char *strlcpy(char *dst, const char *str, unsigned long len);

char title[20];
#define TITLE_CODE 1
char artist[20];
#define ARTIST_CODE 2

#define MIN(a, b) ( a < b ? a : b )

static void set_title(const char *str) {
  strlcpy(title, str, MIN(sizeof(title), my_strlen(str)));
}

static void set_artist(const char *str) {
  strlcpy(artist, str, MIN(sizeof(artist), my_strlen(str)));
}

int set_metadata(const char *str, int code) {
  switch(code) {
  case TITLE_CODE:
    set_title(str);
    break;
  case ARTIST_CODE:
    set_artist(str);
    break;
  default:
    return -1;
  }

  return 0;
}

I was expecting here a single call to my_strlen(), as it’s a pure function, and in both branches it’s the first call. I know it’s probably complex code once unrolled, but still gcc at least was better at this than intel’s and sun’s compilers!

Both Intel’s and Sun’s, even at -O3 level, emit four calls to my_strlen(), as they can’t even optimise the ternary operation! Actually, Sun’s compiler comes last for optimisation, as it doesn’t even inline set_title() and set_artist().

Now, I haven’t tried IBM’s PowerPC compiler as I don’t have a PowerPC box to develop on anymore (although I would think a bit about the YDL PowerStation, given enough job income in the next months — and given Gentoo being able to run on it), so I can’t say anything about that, but for these smaller cases, I think GCC is beating other proprietary compilers under Linux.

I could check Microsoft’s and Borland’s Codegear’s compilers, but it was a bit out of my particular scope at the moment.

If I did think a bit before about supporting non-GNU compilers for stuff like xine and unieject, I start to think it’s not really worth the time spent on that at all, if this is the result of their compilations…

And yet again, I miss Borland’s compiler

Don’t get the title wrong, I like GCC, but there are a few things that don’t trigger a warning in GCC, but do on Borland’s, which are quite useful and important.

The main thing I miss is the warning that Borland gives you when a variable is given a value that is never used. As I wrote more than a week ago GCC does not warn you about unused variables if they are assigned a value after they are declared. Which in xine tends to happen quite some times.

This is pretty important because, even if GCC is good enough not to emit the variable if it’s not used, if the assigned value is the return value of a function, the function call is unlikely to be optimised away. Pure and constant functions should be optimised away, but for functions the compiler has no clue about (which is the common status for non-static functions unless you tell it otherwise) the call is still executed, as it might change the global state variables. If the call is expensive, it would be a waste of CPU.

So I first tried ICC, remembering it used to have nicer and stricter warnings than GCC. Unfortunately even after installing it, getting a license key and opening a new shell with the environment set up, I get this:

/usr/include/stdlib.h(140): error: identifier "size_t" is undefined
  extern size_t __ctype_get_mb_cur_max (void) __THROW __wur;

As you can guess, it’s not very nice that size_t results undefined, and indeed it can’t even complete the ./configure run.

Then I decided to try Sun’s compiler. I remembered Donnie having an ebuild for sunstudio on his overlay, so I downloaded that and installed sunstudio. I had to fix a bit the build system of xine because Sun’s compiler was detected only under Solaris for PThread support, while of course you can use Sun’s compiler under Linux too.

After completing the ./configure run properly, I’ve started seeing issues with xine’s code.. well I expected that. Mostly, the short form of the ternary operation (foo ? : bar, which is equivalent to foo ? foo : bar but with a single evaluation of foo) is not supported – I suppose it’s a GNU extension – but that’s not difficult to fix by avoiding that form…

The problems started the moment it compiled the first source file for xine-lib itself (rather than its tools):

c99: Warning: illegal option -fvisibility=hidden
"../../../src/xine-engine/xine.c", line 83: internal compiler error: Wasted space
c99: acomp failed for ../../../src/xine-engine/xine.c

Now with all the good will I have, what should “Wasted space” mean to me‽

The illegal option is also a nice thing to see, considering that I test that during the ./configure phase, and Sun’s compiler answers me a lot like it works:

configure:49543: checking if compiler supports -fvisibility=hidden
configure:49560: sunc99 -c -features=extensions -errwarn=%all -fvisibility=hidden  conftest.c >&5
c99: Warning: illegal option -fvisibility=hidden
configure:49567: $? = 0
configure:49584: result: yes

Sincerely, I start to think a lot lately when I read about Sun wanting the good of Free Software. I had a few people telling me that xine lacks support for Solaris, Sun Studio compiler, UltraSPARC architecture, … well it’s not like it’s easy to support those, considering that Solaris for x86 is quite slow, and wasn’t working under VirtualBox for a while – it should work now but I haven’t had time to look at it yet, SunStudio for Linux fails, as I just noted, and the only way to get a decent Sun system for a standalone developer is looking and hoping at second hand offers on eBay and similar (a T2 basic server costs about $15K, a bit out of my league, for optimising xine, and as far as I can see all their workstation are now AMD64-based — or x64 as they call it, but I hate that market name as it really means nothing).

Maybe they are just interested in enterprise Free Software, but still… I sincerely think they have the right cards to make some difference, but I can’t see much Free Software development, beside the usual enterprise one, going on with Sun systems in the next future. Which is a bit sad considering I’ve seen my Ultra5 outpowering an Athlon almost twice its frequency…

About code style and code execution

I’m still wandering through xine’s sources to apply my own style to them, hoping that I can spot problematic areas before they become a problem. It’s quite interesting what I’ve seen, as the code is really suboptimal in some cases: functions could be factored out so that they are shared between plugins, in some we have leaks because we allocate stuff and then return without the object, in others we could reduce the size of a code section protected by a mutex.

I haven’t committed and pushed these changes yet because I’m sure it will be a mess to merge them into 1.2, so I’ll first try that when I’m willing to spend time to merge them back into 1.2.

In general, what I’ve been following for xine’s source changes are a few simple rules:

  • whenever possible, use constants instead of variables;
  • whenever possible, declare constants and variables only when you can give them a meaningful value;
  • whenever possible, reduce the scope of variables and constant by adding blocks;
  • in each function, test first all the prerequisites, before any allocation or constant calculation;
  • in each function, when a condition has to be checked and no code has to be executed if it’s not true, return from the function rather than increasing the indentation level with a new if block;
  • if the same exact calculation has to be done on different values, create a function, even if it’s a single return or little more; constant functions will most likely be optimised by the compiler;
  • when a function require using mutexes, make sure that allocations and freeing of memory areas happen outside the mutex-protected code, even if that means keeping memory allocated for a while longer than necessary.

Unfortunately xine, while depending on a C99-compliant compiler already – for FFmpeg – does not require enabling all the C99 features yet. Which means I can’t use C99-style for loops, which I found quite useful to make code clean.

My reasoning for these rules are quite easy actually.

Assigning the value to a variable when it’s declared makes unused variable warnings from GCC more useful (assignment is considered use, which means that a variable declared and then assigned is never considered unused – which by the way is wrong, GCC should warn in that case too, but I’m no good with compilers so I can’t patch it for this), and makes it less likely that a quick change makes use of the variable before a value is assigned.

Marking data constant whenever possible makes it less likely that the meaning of a variable is changed. I find it nicer to have ten constants than one variable with ten different meanings. It also often forces to get more meaningful names for the data instead of relying on variables called tmp, n, i, j, k

Reducing the scope of variables combine the two objectives above: you cannot abuse a variable if its scope is limited, and you cannot use the variable before it gets an useful value if it’s not in your scope.

Checking conditions beforehand makes it possible to avoid allocations before prerequisites are checked, which in turn makes it possible to avoid allocation and freeing cycles. It also makes it possible to avoid locking an unlocking mutexes for no good reason.

Returning earlier from a function is instead useful to avoid multiple indentation levels in case multiple prerequisites are checked in series. It’s just readability that gets improved, but still it’s a lot.

Splitting a calculation makes it less likely to make a copy-paste mistake in the code, and it makes it easier to change it if it needs fixes.

And mutex-protected code should be as limited as possible as it’s code that doesn’t make good use of multicore systems, and in today’s computers world that’s bad.

I think I’ll complete this cleanup soon enough, and then move to 1.2 to change the paradigm a bit more. Mostly I’m tempted to see where we can replace some mutexes with R/W locks (which should make xine work better in parallel, as thread would block each other only while writing).

I also start thinking of a few things we could try to use for 1.3 or 2.0, namely glib’s lists and thread supports to replace xine’s own implementations. My reason for this is that it makes very little sense to reinvent the wheel for these things when they are already well developed and available.

Anyway, I sincerely hope to make some difference in the not-so-distant future, but I don’t count on it just yet. I would really like to see how my changes applies to multicore system, and non-x86 systems. Too bad Sun boxes cost so much, or I’d be glad to work with an 8-way UltraSPARC CPU (and Solaris maybe) and optimise xine for it. If somebody from Sun is reading, a developer discount program would be appreciated ;)