So I wasted a whole evening because of a mistake of mine. Today I decided to benchmark out the byteswapping code from glibc, glib and FFmpeg/libavutil, to ensure that I was doing the right thing by porting lscube to glib entirely, rather than making it use libavutil. The other target I had in mind with this benchmarking was identifying whether glib needed improvements, that could possibly be found on FFmpeg (both glib and most of FFmpeg are LGPL so it should work fine), and vice-versa.
So with this in mind I written a very basic benchmarking rig and prepared everything, and started testing a 100-run on a 1MB file of byteswapping. Astonishingly, glib outperformed both glibc and FFmpeg, and not of percentages, it took half the time! Oh man, I was so glad. So I decided to get a bit more data on bigger files to check that it isn’t a flaw in the benchmark, and it still keeps outperforming them.
I start looking down at what the difference could be in the code, and I notice that the only difference between the code generated on a simple testcase by the glib macro and the
bswap_16 functions from glibc and FFmpeg (both generate the same exact code), is in the sense of the rotation: it rotates left rather than right. It was too stupid for that to be the difference. I check the code of the full benchmark and I notice that there is also some instruction reordering, that GCC could do since there was no inline assembly involved in the glib macro: it’s just the standard shift and bitwise-or macro. I thought that was it and started writing down some notes. And just then I notice that it also changes a parameter of
addq, 4 rather than 2. I check the code better and indeed the problem is in the macro itself.
As you may or may not notice at first glance from the relative documentation the glib byte swapping macros might evaluate their arguments twice, which disallows from using them with expressions having side-effects, like a postfix increase (
foo++) which, as you may guess by now, is exactly what I’ve been doing. Okay, I’m a n00b; there’s no need to add for that, I wasted half a day chasing a ghost that of course couldn’t be there.
On the other hand, I would like to suggest that the documentation could certainly be improved there; what do I mean? Let me quote the page here:
These macros provide a portable way to determine the host byte order and to convert values between different byte orders.
The byte order is the order in which bytes are stored to create larger data types such as the gint and glong values. The host byte order is the byte order used on the current machine.
Some processors store the most significant bytes (i.e. the bytes that hold the largest part of the value) first. These are known as big-endian processors.
Other processors (notably the x86 family) store the most significant byte last. These are known as little-endian processors.
Finally, to complicate matters, some other processors store the bytes in a rather curious order known as PDP-endian. For a 4-byte word, the 3rd most significant byte is stored first, then the 4th, then the 1st and finally the 2nd.
Obviously there is a problem when these different processors communicate with each other, for example over networks or by using binary file formats. This is where these macros come in. They are typically used to convert values into a byte order which has been agreed on for use when communicating between different processors. The Internet uses what is known as ‘network byte order’ as the standard byte order (which is in fact the big-endian byte order).
Note that the byte order conversion macros may evaluate their arguments multiple times, thus you should not use them with arguments which have side-effects.
Yes this is the main description text block of the page, if you didn’t notice it, the place where the behaviour is documented is the last line, I’ll repeat it here to emphasize it: Note that the byte order conversion macros may evaluate their arguments multiple times, thus you should not use them with arguments which have side-effects..
So the behaviour is documented and certainly this is not a bug; on the other hand I would like to point out a couple of things here, the first is that not all the macros and not on all architectures share the same behaviour, which is, in my view, a trap going to explode; whilst it’s true that the order conversion macros have to be different on a per-architecture basis, it’s certainly not nice to have them behave differently with respect to arguments evaluation. In particular on x86 systems, all the macros do one evaluation; if I were to test my benchmark on x86 I would have never noticed, for instance.
The other nitpicking is on the documentation I quoted: the line that warns you about this (unexpected to me) behaviour is the last line, without any emphasis on it, of a block that starts explaining what endianness is and why there’s need for byte swapping macros. Sorry guys but it’s more than likely that a programmer that has worked even just once with network code knows this pretty well, and would decide to skip over the whole section.. included the warning at the end. I don’t know gtk-doc, but I’d expect that, like doxygen, it has a way to put in evidence that particular line: an
warning ornote, or something, so that the attention of the developers would be drawn to that particular, very important line.
I’ve learned much more than just the way glib’s macros behave, today, I learnt that it’s critical that in the API documentation of anything, the important non-obvious information have not to be covered up by the obvious information that might be public knowledge already, and that you should probably try to avoid mixing logical documentation with API reference. Maybe this is a bit exaggerated, but a link to Wikipedia entry on Endianness would have sufficed, in my opinion; it explains all that a newbie needs to know about endianness without covering up important information.
Okay, now enough with the documentation (I hope this criticism is considered constructive, by the way), and let’s talk about the possible improvements. On x86, and AMD64 systems, glibc, glib and FFmpeg have the same code, more or less (as I said, GCC is a commie and prefers rotating left rather than right — just kidding, if you didn’t catch that); on other architectures, the thing is more complicated.
I cannot say for certain of glibc, since I don’t have glibc on any other architecture at the moment (anybody knows of new discounts on the YDL PowerStation ?), I can only guess it always tries to be the most optimised. Macros from glib 2.18 have special inline assembly code for i386, i486+, IA-64 and x86-64. FFmpeg’s libavutil
bswap.h header (which might have been split already by the time this post is published) have special code for x86 (with and without
bswap instruction), x86-64, SH4, ARMv6, ARMv4l and BlackFin.
Can you see the interesting point here? FFmpeg lacks support for IA-64 optimised byteswap macros that glib has, while glib lacks the one for many embedded systems, included ARM that is probably one of the targets of the maemo platform (and one architecture which I happened to use a couple of time for work-related stuff in the past).
Maybe tomorrow I’ll check out if it’s possible to do some cross-pollination between the projects.
Once again, the reason why macro’s should be avoided when possible. Note also, that inline functions would do this just as well, and optimize just as well too.
Oh I wholeheartedly agree with you here.FFmpeg’s inline functions generated basically the same code that glib macros did, on GCC. I’m afraid that the problems here are with other compilers though, I know the Sun Studio Express compiler does not clear out inline functions so easily so yes it might be quite a problem there.