FFmpeg, AppleTV and conversions

Last year, a few months before ending up in the hospital, I bought a LCD TV and an AppleTV device, to watch my animes before sleeping, in my bedroom, relaxing down. After the hospital I ended up not watching much anime anyway, but before that I noticed one nasty thing: my LCD TV “eats” away around 16-20 pixel around the borders with the result that the subtitles for the japanese anime were unreadable. Which was very nasty in my opinion.

After AppleTV Take 2 software was released I also didn’t have much time to play with it to modify it again to install extra codecs, nor I had time to watch much anime, even though I bought a PlayStation 3 to relax. Now, I also considered the idea of selling away the AppleTV, getting a bigger harddrive for the PlayStation 3 and live with that. But since I have already spent enough time in the hospital, having mobile access to my anime is still a feature I’d like to keep, and since I can easily make AppleTV and iPod use the same video files, while PS3 and PSP (which I don’t own, my sister does though) would require different copies, I finally decided to keep it.

I have already tried converting video files but FFmpeg failed, while VisualDub let me do some of the conversions; unfortunately I never had the time to finish the conversion before I ended up in the hospital again. Geez, you really have some trouble when you can count eras based on when you ended up in the hospital. Anyway, since now I got a new box, I decided to try doing the conversion once again with FFmpeg, with all the improvements that arrived up to now.

Before trying that, though, I wanted to be able to do one thing: try to remove data duplication between my workstation (Yamato) and the laptop (Intrepid). My easy way out of this is to make sure that they use the same partition for music and video, which is what they share. Linux and OSX can share mainly two filesystems: FAT32 (nasty) and HFS+. The problem is that HFS+, Apple’s filesystem, can have multiple variants: it may or may not be case-sensitive, and it might be journaled. Linux cannot write to journaled HFS+, and I dislike case insensitivity, so I set up a HFS+ case-sensitive non-journaled partition beside the Time Machine partition on the external hard drive, and connect the drive to Yamato.

Moving the music and video files off Yamato’s drives also helped me to reduce the amount of data I have on the internal drives, which in turn should reduce at least a little the stress on them until I can get a new setup. And Linux HFS+ support is not that bad after all, once you disable journaling. The problem is that I wanted to make sure I could use the same disk both connected to Yamato with network attached, and directly connected to Intrepid. A tried way to do this was to export the partitions via iSCSI, but that would have given exclusive access to the partition to the laptop, and thus caused me once again not to be able to share the data.

The other solution was to export the filesystem via NFS when connected to Yamato, which is what I tried, unfortunately, as I’ve written before, Linux does not support NFS export of HFS+ partitions; or at least it doesn’t right now. I love Free Software just for this reason: if a feature does not exist, I make it up; now I just have to hope that it gets applied upstream in a decent timeframe.

Now that the filesystem problem is mostly behind me (mostly because I still see some failures; I don’t know exactly what they relates to yet), I wanted to make sure I could get a way to convert my content with FFmpeg directly; Yamato is an 8-core system so it should be faster than the laptop to do the work. So I got the latest FFmpeg version out of FFmpeg, and tried.

So the first issue is getting the right container format, while ISO Media, QuickTime, iPod video files, and so on are all based on more or less the same idea, to the point that xine can easily use a single demuxer for all of them, as well as libavformat, there are some tricky issues with having those files working on QuickTime, iTunes and AppleTV. Luckily, FFmpeg has an ipod preset that does the job, mostly, which is very good. The problem is the “mostly”: these formats are based upon elements called atoms, atoms can have versions; the common version 0 that is supported by Apple’s software has 32-bit values for timescales, offsets and stuff like that; alternatively there exist a version 1 that uses 64-bit values instead; xine supports both, but Apple’s software does not. The intersting bit here is that FFmpeg produces version 1 atoms whenever the time scale does not fit as it is in the 32-bit values, without warning even when asking for the iPod container format. The trick here is to force a timescale conversion; most of the content I got is encoded with approximately 29.97 fps, but using a timescale in AVI files that does not fit into version 0 atoms; changing the timescale to basically the same through -r 29.97 fixes the problem for me.

Now, time to get correct video and audio codecs. If you have ever tried to do this you know you have to use H.264 for video and AAC for audio; I used 800kbit/s for video and 128kbit/s for audio; just make sure you pass the parameters before the output filename, otherwise they won’t be picked up. To make sure to enable the proper features for the x264 encoder, I used the presets that are in FFmpeg’s repository; I chosen to skip the -max version since that was giving me less than one frame per second, which was very very bad, I used hq instead, which gave me good enough results. The problem is the default version enables 8×8 DCT and B-Pyramids, which QuickTime does not properly support. Files encoded with these two features enabled were synced properly to the AppleTV but crashed it down once played. Not so nice. To fix this, just replace the + symbol with a in front of the two features dct8x8 and bpyramids. Problem solved, and the files play.

For some reason, FFmpeg’s MS-Mpeg4 decoder dies when enabling more than 8 threads, and with so many threads, only about four cores get used on Yamato; I’m not sure why the limit is imposed by the video decoder, since I would expect at least a thread to be able to handle audio, but anyway, since that’s the deal, I decided to convert two files at a time, and to do that, well we have GNU make don’t we?

I wrote this Makefile:

VPATH = $(ORIG)

SRCS = $(notdir $(wildcard $(ORIG)/*.avi $(ORIG)/*.mp4))
all: $(patsubst %.mp4,%.m4v,$(patsubst %.avi,%.m4v,$(SRCS)))

%.m4v: $(filter %.avi,$(SRCS)) $(filter %.mp4,$(SRCS))
    ffmpeg -threads 8 -i $< -vpre libx264-appletv-hq.ffpreset -b 800k -ab 128k -acodec libfaac -padcolor 000000 -padtop 16 -padbottom 16 -padleft 16 -padright 16 -r 29.97 -f ipod "$@"

As you can see, I’m padding 16 pixels around the video for the subtitles to appear on my TV, thanks to FFmpeg, doing the padding is very quick and does not degrade quality.

I just run make -f ~/makefle.convert -j2 ORIG=/directory and there it goes, the conversion is running, and it’ll be done in… a few hours.

Glib, byte swapping, and hidden documentation (and improvements)

So I wasted a whole evening because of a mistake of mine. Today I decided to benchmark out the byteswapping code from glibc, glib and FFmpeg/libavutil, to ensure that I was doing the right thing by porting lscube to glib entirely, rather than making it use libavutil. The other target I had in mind with this benchmarking was identifying whether glib needed improvements, that could possibly be found on FFmpeg (both glib and most of FFmpeg are LGPL so it should work fine), and vice-versa.

So with this in mind I written a very basic benchmarking rig and prepared everything, and started testing a 100-run on a 1MB file of byteswapping. Astonishingly, glib outperformed both glibc and FFmpeg, and not of percentages, it took half the time! Oh man, I was so glad. So I decided to get a bit more data on bigger files to check that it isn’t a flaw in the benchmark, and it still keeps outperforming them.

I start looking down at what the difference could be in the code, and I notice that the only difference between the code generated on a simple testcase by the glib macro and the bswap_16 functions from glibc and FFmpeg (both generate the same exact code), is in the sense of the rotation: it rotates left rather than right. It was too stupid for that to be the difference. I check the code of the full benchmark and I notice that there is also some instruction reordering, that GCC could do since there was no inline assembly involved in the glib macro: it’s just the standard shift and bitwise-or macro. I thought that was it and started writing down some notes. And just then I notice that it also changes a parameter of addq, 4 rather than 2. I check the code better and indeed the problem is in the macro itself.

As you may or may not notice at first glance from the relative documentation the glib byte swapping macros might evaluate their arguments twice, which disallows from using them with expressions having side-effects, like a postfix increase (foo++) which, as you may guess by now, is exactly what I’ve been doing. Okay, I’m a n00b; there’s no need to add for that, I wasted half a day chasing a ghost that of course couldn’t be there.

On the other hand, I would like to suggest that the documentation could certainly be improved there; what do I mean? Let me quote the page here:

Description

These macros provide a portable way to determine the host byte order and to convert values between different byte orders.

The byte order is the order in which bytes are stored to create larger data types such as the gint and glong values. The host byte order is the byte order used on the current machine.

Some processors store the most significant bytes (i.e. the bytes that hold the largest part of the value) first. These are known as big-endian processors.

Other processors (notably the x86 family) store the most significant byte last. These are known as little-endian processors.

Finally, to complicate matters, some other processors store the bytes in a rather curious order known as PDP-endian. For a 4-byte word, the 3rd most significant byte is stored first, then the 4th, then the 1st and finally the 2nd.

Obviously there is a problem when these different processors communicate with each other, for example over networks or by using binary file formats. This is where these macros come in. They are typically used to convert values into a byte order which has been agreed on for use when communicating between different processors. The Internet uses what is known as ‘network byte order’ as the standard byte order (which is in fact the big-endian byte order).

Note that the byte order conversion macros may evaluate their arguments multiple times, thus you should not use them with arguments which have side-effects.

Yes this is the main description text block of the page, if you didn’t notice it, the place where the behaviour is documented is the last line, I’ll repeat it here to emphasize it: Note that the byte order conversion macros may evaluate their arguments multiple times, thus you should not use them with arguments which have side-effects..

So the behaviour is documented and certainly this is not a bug; on the other hand I would like to point out a couple of things here, the first is that not all the macros and not on all architectures share the same behaviour, which is, in my view, a trap going to explode; whilst it’s true that the order conversion macros have to be different on a per-architecture basis, it’s certainly not nice to have them behave differently with respect to arguments evaluation. In particular on x86 systems, all the macros do one evaluation; if I were to test my benchmark on x86 I would have never noticed, for instance.

The other nitpicking is on the documentation I quoted: the line that warns you about this (unexpected to me) behaviour is the last line, without any emphasis on it, of a block that starts explaining what endianness is and why there’s need for byte swapping macros. Sorry guys but it’s more than likely that a programmer that has worked even just once with network code knows this pretty well, and would decide to skip over the whole section.. included the warning at the end. I don’t know gtk-doc, but I’d expect that, like doxygen, it has a way to put in evidence that particular line: an warning ornote, or something, so that the attention of the developers would be drawn to that particular, very important line.

I’ve learned much more than just the way glib’s macros behave, today, I learnt that it’s critical that in the API documentation of anything, the important non-obvious information have not to be covered up by the obvious information that might be public knowledge already, and that you should probably try to avoid mixing logical documentation with API reference. Maybe this is a bit exaggerated, but a link to Wikipedia entry on Endianness would have sufficed, in my opinion; it explains all that a newbie needs to know about endianness without covering up important information.

Okay, now enough with the documentation (I hope this criticism is considered constructive, by the way), and let’s talk about the possible improvements. On x86, and AMD64 systems, glibc, glib and FFmpeg have the same code, more or less (as I said, GCC is a commie and prefers rotating left rather than right — just kidding, if you didn’t catch that); on other architectures, the thing is more complicated.

I cannot say for certain of glibc, since I don’t have glibc on any other architecture at the moment (anybody knows of new discounts on the YDL PowerStation ?), I can only guess it always tries to be the most optimised. Macros from glib 2.18 have special inline assembly code for i386, i486+, IA-64 and x86-64. FFmpeg’s libavutil bswap.h header (which might have been split already by the time this post is published) have special code for x86 (with and without bswap instruction), x86-64, SH4, ARMv6, ARMv4l and BlackFin.

Can you see the interesting point here? FFmpeg lacks support for IA-64 optimised byteswap macros that glib has, while glib lacks the one for many embedded systems, included ARM that is probably one of the targets of the maemo platform (and one architecture which I happened to use a couple of time for work-related stuff in the past).

Maybe tomorrow I’ll check out if it’s possible to do some cross-pollination between the projects.

Supporting more than one compiler

As I’ve written before, I’ve been working on FFmpeg to make it build with the Sun Studio Express compiler, under Linux and then under Solaris. Most sincerely, while supporting multiple (free) operating systems, even niche Unixes (like Lennart likes to call them) is one of the things I spend a lot of time on, I have little reason to support multiple compilers. FFmpeg on the other hand tends to support compilers like the Intel C Compiler (probably because it sometimes produces better code than the GNU compiler, especially when coming to MMX/SSE code — on the other hand it lacks some basic optimisation), so I decided to make sure I don’t create regressions when I do my magic.

Right now I have five different compile trees for FFmpeg: three for Linux (GCC 4.3, ICC, Sun Studio Express), two for Solaris (GCC 4.2 and Sun Studio Express). Unfortunately the only two trees to build entirely correctly are GCC and ICC under Linux. GCC under Solaris still needs fixes that are not available upstream yet, while Sun Studio Express has some problem with libdl under Linux (but I think the same applies to Solaris), and explodes entirely under Solaris.

While ICC still gives me some problems, Sun Studio is giving me the worst headache since I started this task.

While Sun seems to strive to reach GCC compatibility, there are quite a few bugs in their compiler, like -shared not really being the same as -G (although the help output states so). Up to now the most funny bug (or at least absurd idiotic behaviour) has been the way the compiler handles libdl under Linux. If a program uses the dlopen() function, sunc99 decides it’s better to silently link it to libdl, so that the build succeeds (while both icc and gcc fail since there is an undefined symbol), but if you’re building a shared object (a library) that also uses the function, that is not linked against libdl. It remembered me of FreeBSD’s handling of -pthread (it links the threading library in executables but not in shared objects), and I guess it is done for the same reason (multiple implementation, maybe in the past). Unfortunately since it’s done this way, the configure will detect dlopen() not requiring any library, but then later on libavformat will fail the build (if vhook or any of the external-library-loading codecs are enabled).

I thus reported those two problems to Sun, although there are a few more that, touching some grey areas (in particular C99 inline functions), I’m not sure to treat as Sun bugs or what. This includes for instance the fact that static (C99) inline functions are emitted in object files even if not used (with their undefined symbols following them, causing quite a bit of a problem for linking).

The only thing for which I find non-GCC compilers useful is to take a look to their warnings. While GCC is getting better at them, there are quite a few that are missing; both Sun Studio and ICC are much more strict with what they accept, and raise lots of warnings for things that GCC simply ignores (at least by default). For instance, ICC throws a lot of warnings about mixing enumerated types (enums) with other types (enumerated or integers), which gets quite interesting in some cases — in theory, I think the compiler should be able to optimise variables if they know they can only assume a reduce range of values. Also, both Sun Studio, ICC, Borland and Microsoft compilers warn when there is unreachable code in sources; recently I discovered that GCC, while supporting that warning, disables it by default both with -Wall and -Wextra to avoid false positives with debug code.

Unfortunately, not even with the combined three of them I’m getting the warning I was used to on Borland’s compiler. It would be very nice if Codegear decided to release an Unix-style compiler for Linux (their command-line bcc for Windows does have a syntax that autotools don’t accept, one would have to write a wrapper to get those to work). They already released free as in soda compilers for Windows, it would be a nice addition to have a compiler based upon Borland’s experience under Linux, even if it was proprietary.

On the other hand, I wonder if Sun will ever open the sources of Sun Studio; they have been opening so many things that it wouldn’t be so impossible for them to open their compiler too. Even if they decided to go with CDDL (which would make it incompatible with GCC license), it could be a good way to learn more things about the way they build their code (and it might be especially useful for UltraSPARC). I guess we’ll have to wait and see about that.

It’s also quite sad that there isn’t any alternative open source compiler focusing, for instance, toward issuing warnings rather than optimising stuff away (although it’s true that most warnings do come out of optimisation scans).

So, what am I doing with OpenSolaris?

I’ve written more than once in the past weeks about my messing with OpenSolaris, but I haven’t explained very well why I’m doing that, and what exactly is that I’m doing.

So the first thing I have to say is that since I started getting involved in lscube I focused on getting the buildsystem in shape so that it could be more easily built, especially with out-of-tree builds, which is what I usually do since I might have to try the build with multiple compilers (say hi to the Intel C Compiler and Sun Studio Express). But since then, I only tested it under Linux, which is quite a limitation.

While FreeBSD is reducing tremendously the gap it had against GNU/Linux (here it’s in full, since I intend Linux and glibc together), OpenSolaris has quite a few differences from it, which makes it an ideal candidate to check for possible GNUisms creeping into the codebase. Having the Sun Studio compiler available too makes it also much simpler to test with non-GCC compilers.

Since the OpenSolaris package manager sucks, I installed Gentoo Prefix, and moved all the tools I needed, included GCC and binutils, in that. This made it much easier to deal with installing the needed libraries and tools for the projects, although some needed some tweaking too. Unfortunately there seems to be a bug with GNU ld from binutils, but I’ll have to check if it’s present also in the default binutils version or if it’s just Gentoo patching something wrong.

While using OpenSolaris I think I launched quite a few nasty Etruscan curses toward some Sun developers for some debatable choices, the first being, as I’ve already extensively written about, the package manager. But there has been quite a few other issues with libraries and include files, and the compiler itself.

Since feng requires FFmpeg to build, I’ve also spent quite a lot of time trying to get FFmpeg to build on OpenSolaris, first with GCC, then with Sun Studio, then again with GCC and a workaround with PIC: the bug I noted above with binutils is that GNU ld doesn’t seem to be able to create a shared object out of object not compiled with PIC, so it requires -fPIC to be forced on for them to build, otherwise the undefined symbols for some functions, like htonl() become absolute symbols with value 0 which cause obvious linking errors.

Since I’ve been using OpenSolaris from a VirtualBox virtual machine (which is quite slow even though I’m using it without Gnome, using SSH to have a login, and jumping inside the Gentoo prefix installation right away), I ended up trying to first build FFmpeg with the Sun Studio compiler taken from Donnie’s overlay under Linux, with Yamato building with 16 parallel processes. The problem here is that the Sun Studio compiler is quite a moving target, to the point that a Sun employee, Roman Shaposhnik , suggested me on ffmpeg-devel to try Sun Studio Express (which is, after all, what OpenSolaris has too), that should be more similar to GCC than the old Sun Studio 10 was. This is why dev-lang/sunstudioexpress is in portage, if you didn’t guess it earlier.

Unfortunately even with the latest version of Sun Studio compiler, building FFmpeg has been quite some trouble. I ended up fighting quite a bit with the configure script and not limited to that, but luckily, now most of the patches I have written have been sent to ffmpeg-devel (and some of them accepted, others I’ll have to rewrite or improve depending on what the FFmpeg developers think about them). The amount of work needed just to get one dependency to work is probably draining up the advantage I had in using Gentoo Prefix for those dependencies that work out of the box with OpenSolaris.

(I’ll probably write about the FFmpeg changes more extensively as they deserve a blog entry on their own , and actually the drafts for the blog entries I have to write starts to pile up just as much as the entries in my TODO list.)

While using OpenSolaris I also started understanding why many people hate Solaris this much; a lot of command spit out errors that don’t let you know at all what the real problem is (for instance if I try to mount a nfs filesystem with a simple mount nfs://yamato.local/var/portage /imports/portage, I get an error telling me that the nfs path is invalid; on the other hand, the actual error here is that I need to add -o vers=2 to request NFSv2 (why it doesn’t seem to work with v3 is something I didn’t want to investigate just yet). Also, the OpenSolaris version I’m using, albeit it’s described as “Developer Edition”, lacks a lot of man pages for the library functions (although I admit that most of those which are present are very clear).

In addition to the porting I’ve written about, I’ve also taken the time to extend the testsuite of my ruby-elf, so that Solaris ELF files are better supported; it is interesting to note that the elf.h file from OpenSolaris contain quite more definitions about that, I haven’t yet looked at the man pages to see if Sun provide any description about the Sun-specific sections, for which I’d also like to add further parsers classes. It has been interesting since neither the GNU nor the Sun linkers set the ELF ABI property to values different from SysV (even though both Linux and Sun/Solaris have an ABI value defined in the standard), and GNU and Sun Sections have some overlapping values (like the sections used for symbol versioning: glibc and Solaris have different ways to handle those, but the section type ID used for both is the same; the only way to discern between the two is the section name).

At the end, to resolve the problem, I modified ruby-elf not to load the sections at once, but just on request, so that by the time most sections are loaded, the string table containing the sections’ names is available. This allows to know the name of the section, and thus discern the extended sections by name rather than ABI. Regression tests have been added so that the sections are loaded properly for different elf types too. Unfortunately I haven’t been able to produce a static executable on Solaris with neither the Sun Studio compiler nor GCC, so the only tests for the Solaris ELF executables are for dynamic executables. Nonetheless, the testsuite for ruby-elf (which is the only part of it to take up space: out of 3.0MB of space occupied by ruby-elf, 2.8MB are for the tests) reached 72 different tests and 500 assertions!

I bought a software license

I finally decided to convert my video library to Mpeg4 containers, H.264 video and AAC audio, rather than mixing and matching what I had before that. This is due to the fact that I hardly use Enterprise to watch video anymore. Not only because my office is tremendously hot during the summer, but more because I have a 32” TV set in my bedroom. Nicer to use.

Attached to that TV set there is an Apple TV (running unmodified software 2.0.2 at the moment) and a PS3. If you add to that all the other hardware that can play video I own, the only common denominator is H.264/AAC in MP4 container. (I also said before that I like the MP4 format more than AVI or Ogg). It might be because I do have a few Apple products (iPod and AppleTV), but also Linux handle this format pretty well, so I don’t feel bad about the choice. Beside, new content I get from youtube (like videos from Beppe Grillo’s blog) are also in this format — you can get them with youtube-dl -b.

Unfortunately, as I discussed before with Joshua, and as I tried last year before the hospital already, converting video to this format with Linux is a bit of a mess. While mencoder has very good results for the audio/video stream conversions, producing a good MP4 container is a big issue. I tried fixing a few corner cases in FFmpeg before, but it’s a real mess to produce a file that QuickTime (thus iTunes, and thus the Apple TV) can accept.

After spending a couple more days on the issue I decided my time is worth more than what I’ve been doing, and finally gave up to buy a tool that I have been told does the job, VisualHub for OSX. It was less than €20, and that is usually what I’m paid by the hour for my boring jobs.

I got the software, tried it out, the result was nice. Video and audio quality on par with mencoder’s but a properly working MP4 container that QuickTime, iTunes, AppleTV, iPod and even more importantly xine can play nicely. But the log showed a reference to “libavutil”, which is FFmpeg. Did I just pay for Free Software?

I looked at the Bundle, it includes a COPYING.txt file which is, as you might have already suspected, the text of GPL version 2. Okay, so there is free software in here indeed. And I can see a lot of well-known command line utilities: lsdvd, mkisofs, and so on. One nice thing to see is, though, an FFmpeg SVN diff. A little hidden, but it’s there. Good.

The doubt then was if they were hiding the stuff or if it was shown and I did just miss it. Plus it has to have the sources of everything, not just a diff of FFmpeg’s. And indeed in the last page of the documentation provided there is a link to this that contains all the sources of the Free software used. Which is actually quite a lot. They didn’t limit themselves to take the software as it is though, I see at least some patches to taglib that I’d very much like to take a look to later — I’m not sharing confidential registered-users-only information by the way, the documentation is present in the downloadable package that acts as a demo too.

I thought about this a bit. They took a lot of Free Software, adapted it, written a frontend and sold licenses for it. Do I have a problem with this? My conclusion is that I don’t. While I would have preferred is they made it more clear on the webpage that they are selling a Free Software-based package, and that they would have made the frontend Free Software too, I think they are not doing anything evil with this. They are playing well by the rules, and they are providing a working software.

They are not trying to exploit Free Software without giving anything back (the sources are there) and they did something more than just package Free Software together, they tested and prepared presets to use for encoding for various targets, included Apple TV which is my main target. They are, to an extent, selling a service (their testing and presets choices), and their license is also quite acceptable to me (it’s like a family license, usable on all the household’s computers as well as a work computer in an eventual office).

At the end of the day, I’m happy of spending this money as I suppose it’s also going to further develop the Free Software part of the software too, although I would have been happier to chip in a bit more if it was fully Free Software.

And most importantly, it worked out of the tarball solving me a problem I was having for more than an year now. Which means, for me, a lot less time spent trying to get the whole thing working. Of course if one day I could just do everything with simply FFmpeg I’ll be very happy, and I’ll dedicate myself a bit more on MP4 container support, both in writing and parsing, in the future, but at least now I can just feed it the stuff I need converted and dedicate my time and energy toward more useful goals (for me, as in paid jobs, and for the users with Gentoo).

xine’s new QuickTime demuxer

Today I started some maintenance work for xine-lib, trying to get rid of a function I looked to get rid of in quite some time (the xine malloc wrapper). Issue after issue, I came down to working on the QuickTime demuxer. It’s quite a huge demuxer, and widely used nowadays as it’s the one used for Mpeg4 files (like m4a).

The big complex part of the demuxer is the parser for the moov atom, which contains all the important information about the file.

Right now, I have a huge rewrite of the basic parser function that, instead of going on and on and on with “if” conditions, contains a series of tables which are used by a single parsing function to execute a series of callbacks when the atoms are found.

The new code is bigger, as the functions are never inlined (their pointer is taken), it might be slower because it’s looking at the tables at runtime, but it’s certainly more readable than the previous code, and easier to maintain, which is quite an important factor considering that there are only two xine developers out there, me and Darren.

I’m not even sure if it actually is slower, because the old code was quite complex, and I wouldn’t be surprised if at the end of the day my code ends up being on par with the original.

If from one side I like doing this work, I really need to find the motivation to restart working on 1.2 series. This huge rewrite will probably fall in 1.1, for a series of reasons. It doesn’t require changes to the API so it’s fine that way. But it would be quite more simpler if I could have some better support from xine, like decent lists and similar.

Tomorrow I’ll take a day off and play, hopefully it will recharge me, today I spent almost all day long between 10 in the morning to the next 1 am on xine-lib’s code, that’s more than 12 hours spent on coding, especially on something like xine-lib which… well isn’t exactly the nicest hobby one might have.

The slowness in xine’s build doesn’t help either. Even with ccache, it forces me to relink every time I run install, even just for plugins. I should see if it would solve to remove freetype’s libtool archive, as that’s what is warning me a lot about.

And I forgot to say that there seems to be problem with the latest FFmpeg SVN, which I haven’t checked yet. Sigh!

Bribes are welcome, although at the moment I’d be happier to know if I have a new contract this summer so I can afford a new box. Thanks to “Joe User” I now know where to find what I need, and it doesn’t even cost too much (just €1600 for an 8-core), I just need a couple more contracts…

Summer of Code, Gentoo and other projects

So it seems we got accepted as organisation for Google Summer of Code 2008! And so were FFmpeg and FreeBSD (at least, I heard about those on a few blog).

I wish to remember the users who’re interested in Gentoo/FreeBSD that the main way to improve Gentoo/FreeBSD (once I find time to take back the project again, and make FreeBSD 7.0 ebuilds) is to improve FreeBSD itself. So if you want, you can apply for their SoC and still help the Gentoo project!

And FFmpeg is also an important project for me to have new people working on it: xine is based on that, so it’s a very important project. If you feel you can actually work on that area, join that SoC!

But of course make sure to check out Gentoo ideas, and feel free to contact me if you want further information on the project I proposed myself.

And if you still don’t know where to apply, check out the ideas for the rest of Summer of Code 2008!

Warning! Current FFmpeg Subversion will likely break all your builds!

Don’t worry even if I didn’t blog for a few days ;) I’m currently mostly swamped in work for my job, I’m using CodeGear’s C++ Builder now, rather than C#, but it still makes me wish I remained working on Linux! On the other hand, I’ve added FIMP to my sidebar as option, it’s another foundation for Pancreas diseases, but this time it’s an Italian one – and it involves one of the doctors who took care of me (even if indirectly) during my hospitalisation – and for those working in Italy, it’s also possible to donate IRPEF’s 5‰, so that would also be transparent

This post is just a warning to all the people using FFmpeg in their projects. The current Subversion copy of FFmpeg finally installs headers in subdirectories, so for instance rather than having avutil.h you have libavutil/avutil.h. This is very useful to avoid ambiguity with libavutil/crc32.h, but it requires all the software to be updated, as it’s not transparent.

I’m having some hard time trying to get it to work properly with xine-lib so that both 1.1 and 1.2 work with both old and new FFmpeg installs.

This also works as a warning for Luca, so that he will remember not to unleash a snapshot taken from these versions of FFmpeg on the tree without a full tree rebuild ;)

Introducing cowstats

No it’s not a script to find statistics on Larry, it’s a tool to get statistics for copy-on-write pages.

I’ve been writing for quite a while about memory usage, RSS memory and other stuff like that on my blog so if you want to get some more in-deep information about it, please just look around. If I start linking here all the posts I’ve made on the topic (okay the last one is not a blog post ;) ) I would probably spend the best part of the night to dig them up (I only linked here the most recent ones on the topic).

Trying to summarise for those who didn’t read my blog for all this time, let’s start with saying that a lot of software, even free software, nowadays wastes memory. When I say waste, I mean it uses memory without a good reason to, I’m not saying that software that uses lots of memory to cache or precalculate stuff and thus be faster is wasting memory, that’s just using memory. I’m not even referring to memory leaks, which are usually just bugs in the code. I’m saying that a lot of software wastes memory when it could save memory without losing performances.

The memory I declare wasted is that memory that could be shared between processes, but it’s not. That’s a waste of memory because you end up using twice or more of the memory for the same goal, which is way sub-optimal. Ben Maurer (a GNOME contributor) wrote a nice script (which is in my overlay if you want; I should finish fixing a couple of things up in the ebuild and commit it to main tree already, the deps are already in main tree) that tells you, for a given process, how much memory is not shared between processes, the so-called “dirty RSS” (RSS stands for Resident Set Size, it’s the resident memory, so the memory that the process is actually using from your RAM).

Dirty RSS is caused by “copy-on-write” pages. What is a page, and what is a copy-on-write pages? Well, memory pages are the unit used to allocate memory to processes (and to threads, and kernel systems, but let’s not go too in deep there); when a process is given a page, it usually also get some permission on that, it might be readable, writable or executable. Trying not to get too in deep on this either (I could easily write a book on this, maybe I should, actually), the important thing is that read-only pages can easily be shared between processes, and can be mapped directly from a file on-disk. This means that two process can use both the same 4KB read-only page, using just 4KB of memory, while if the same content was present in a writable page, the two processes would have their own copy of it, and would require 8KB of memory. Maybe more importantly, if the page is mapped directly from a file on-disk, when the kernel need to make space for new allocated memory, it can just get rid of the page, and then re-load it from the original file, rather than writing it down on the swap file, and then load from that.

To make it easier to load the data from the files on disk, and reduce the memory usage, modern operating systems use copy-on-write. The pages are shared as long as they are not changed from the original; when a process tries to change the content of the page, it’s copied in a new empty, writable page, and the process gets exclusive access to the page, “eating” the memory. This is the reason why using PIC shared objects usually save memory, but that’s another story entirely.

So we should reduce the amount of copy-on-write pages, trying to favour read-only sharable pages. Great, but how? Well, the common way to do so is to make sure that you mark (in C) all the constants as constant, rather than defining them as variables even if you never change their value. Even better, mark them static and constant.

But it’s not so easy to check the whole codebase of a long-developed software to mark everything constant, so there’s the need to analyse the software post-facto and identify what should be worked on. To do so I used objdump (from binutils) up to now, it’s a nice tool to have raw information about ELF files, it’s not easy, but I grew used to it so I can easily grok its output.

Focusing on ELF files, which are the executable and library files in Linux, FreeBSD and Solaris (plus other Unixes), the copy-on-write pages are those belonging, mostly, to these sections: .data, .data.rel and .bss (actually, there are more sections, like .data.local and .data.rel.ro, but let’s just consider those prefixes for now).

.data section keeps the non-stack variables (which means anything declared as static but non-constant in C source) which were initialised in the source. This is probably the cause of most waste of memory: you define a static array in C, you don’t mark it constant properly (see this for string arrays), but you never touch it after definition.

.data.rel section keeps the non-stack variables that need to be relocated at runtime. For instance it might be a static structure with a string, or a pointer to another structure or an array. Often you can’t get rid of relocations, but they have a cost in term of CPU time used, and also a cost in memory usage, as the relocation will trigger for sure the copy-on-write… unless you use prelink, but as you’ll read on that link, it’s not always a complete solution. You usually can live with these, but if you can get rid of instances here, it’s a good thing.

.bss section keeps the uninitalised non-stack variables, for instance if you declare and define a static array, but don’t fill it at once, it will be added to the .bss section. That section is mapped on the zero page (a page entirely initialised to zero, as the name suggests), with a copy-on-write: as soon as you write to the variable, a new page is allocated, and thus memory is used. Usually, runtime-initialised tables falls into this section. It’s often possible to replace them (maybe optionally) with precalculated tables, saving memory at runtime.

My cowstats script analyse a series of object files (tomorrow I’ll work on an ar parser so that it can be ran on static libraries; unfortunately it’s not possible to run it on executables or shared libraries as they tend to hide the static symbols, which are the main cause of wasted memory), looks for the symbols present in those sections, and lists them to you, or in alternative it shows you some statistics (a simple table that tells you how many bytes are used in the three sections for the various object files it was called with). This way you can easily see what variables are causing copy-on-write pages to be requested, so that you can try to change them (or the code) to avoid wasting memory.

I wrote this script because Mike asked me if I had an automated way to identify which variables to work on, after a long series of patches (many of which I have to fix and re-submit) for FFmpeg to reduce the memory usage. It’s now available at https://www.flameeyes.eu/p/ruby-elf as it’s simply a Ruby script using my ELF parser for ruby started last May. It’s nice to see that something I did some time ago for a completely different reason now comes useful again ;)

I mailed the results on my current partly-patched libavcodec, they are quite scary, it’s over 1MB of copy-on-write pages. I’ll continue working so that the numbers will come near to zero. Tomorrow I’ll also try to run the script on xine-lib’s objects, as well as xine-ui. It should be interesting.

Just as a test, I also tried running the script over libvorbis.a (extracting the files manually, as for now I have no way to access those archives through Ruby), and here are the results:

cowstats.rb: lookup.o: no .symtab section found
File name  | .data size | .bss size  | .data.rel.* size
psy.o             22848            0            0
window.o          32640            0            0
floor1.o              0            8            0
analysis.o            4            0            0
registry.o           48            0            0
Totals:
    55540 bytes of writable variables.
    8 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 55548 bytes of variables in copy-on-write sections

(The warning tells me that the lookup.o file has no symbols defined at all; the reason for this is that the file is under one big #ifdef; the binutils tools might be improved to avoid packing those files at all, as they can’t be used for anything, bearing no symbol… although it might be that they still can carry .init sections, I admit my ignorance here).

Now, considering the focus of libvorbis (only Vorbis decoding), it’s scary to see that there are almost 55KB of memory in writable pages; especially since, looking down to it, I found that they are due to a few tables which are never modified but are not marked as constant.

The encoding library libvorbisenc is even worse:

File name   | .data size | .bss size  | .data.rel.* size
vorbisenc.o      1720896            0            0
Totals:
    1720896 bytes of writable variables.
    0 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 1720896 bytes of variables in copy-on-write sections

Yes that’s about 1.7 MB of writable pages brought in by libvorbisenc per every process which uses it. And I’m unfortunately going to tell you that any xine frontend (Amarok included) might load libvorbisenc, as libavcodec has a vorbis encoder which uses libvorbisenc. Yes it’s not nice at all!

Tomorrow I’ll see to prepare a patch for libvorbis (at least) and see if Xiph will not ignore me at least this time. Once the script will be able to act on static libraries, I might just run it on all the ones I have on my system and identify the ones that really need to be worked on. This of course will have not to hinder my current jobs (I’m considering this in-deep look at memory usage part of my job as I’m probably going to need it in a course I have to teach next month), as I really need money, especially to get a newer box before end of the year, Enterprise is getting slow.

Mike, I hope you’re reading this blog, I tried to explain the thing I’ve been doing in the best way possible :)

Some more details about tables

I have a cold, it’s four days I’m laying in bed because of it, and I’m bored. Anyway I sent a few more patches to ffmpeg-devel today, although the fever did increase my mistake ratio.

Anyway, continuing to talk about tables, I’ve also mailed to ffmpeg-devel a simple command to check the amount of .bss tables that are present in libavcodec:

{objdump -t libavcodec/*.o | fgrep .bss | 
awk '{ print $5 }' | sed -e 's:^:0x:' 
-e 's:$: + :'; echo 0; } | irb

The result was surprisingly: 961KiB (plus the tables that I hardcoded locally and sent upstream). This means that every process using libavcodec will use up to that amount of memory of COW’d tables. A fully initialised libavcodec will use that amount of memory in COW’d tables.

Replacing all the tables with hardcoded tables will increase the size of libavcodec library by that size, but in turn, there will be no COW. This means faster startup for the library, as no COW is triggered for initialisation, and less memory used.

Tomorrow, if I feel better, I’ll be adding code for generating the tables to FFmpeg so that it would be less annoying to maintain the tables. I sincerely hope it can be integrated, so that Amarok will use less memory ;)