This Time Self-Hosted
dark mode light mode Search

Fighting with the glib slice allocator

I’ve written the other day how I think glib is good for server software; Lennart did point out that there is one problem with the memory management because the way g_malloc() works, and that it shouldn’t thus be used for lower system level software like udev and similar because you cannot make sure you stay inside your resource limits.

Today I noticed one more problem that I’ve got to work on to get feng to be a nice piece of software, while using glib. This problem is also related to memory management, but this time it’s due to the slice allocator, which is a slab-like allocator for the userspace that is used both by all the newer glib structures allocation, and by most of feng’s structure when we’re dealing with fixed size objects (feng also has lots of non-fixed-size objects, but those cannot be easily allocated with slices).

It actually worked quite well when implemented for most of the objects; where this turned out to be a problem was in my producer/consumer code (now fine tuned and pretty much working fine). It was actually quite curious because to reproduce the problem I had to use one particular file: the 1080p AVI version of Big Buck Bunny. Using the same 1080p version in H.264 and QuickTime format, the problem does not happen at all.

So what the problem is? When playing back one or more times that file with one or multiple clients, the amount of memory used by the feng process starts to allocate more and more and more memory; Luca stopped when it reached 1GiB of resident memory, I stopped at 2 GiB. Now, the reason why this happens is because my producer/consumer code (which is where the problem starts to appear) is not limited in the number of packets that are read (even though it does not read in memory the whole file at once, it’ll wait to be asked to read more data, but once requests start to arrive, it can be a problem), and thus it will create a very high number of objects in a very short time, and since the slab allocator would run out of memory quite quickly, it’ll start allocating new memory chunks to store the objects in. _As an added note, the objects I’m talking about in this article are just the internal representation of the actual objects produced and consumed, of the same type as glib’s own GList objects. They are not the actual data that gets read (which is variable-sized) they are just small objects that keep a pointer to the actual data.

Now, the problem is that these chunks are, for what I can see, never freed until, at least, the thread that allocated them is killed; this would be okay if we were handling everything in separated threads, but there are a number of variables in game here: the allocation and deallocation happen on different threads (the former on the producer object, which is executed by the resource-reading thread, and the latter usually on the consumer, unless you abort the playback, which is either the main thread now or the worker thread when we’ll switch); resource reading happens either on a separated thread (called “mediathread”), or in the newer code in one of a series of threads in a pool where the read event is done asynchronously; neither of these two threads will be deleted by the time the connection is closed. There is also a concern about creating multiple threads per client since the server is supposed to be able to handle hundreds of clients just fine.

Just to make sure the problem lies with the slice allocator code, I’ve ran valgrind disabling the slice allocator by setting G_SLICE=always-malloc. This is actually quite interesting since it has shown me that there are other problems in feng that were hidden earlier by the slice code. It also has shown that there is really no leak, at least in my producer/consumer code. So the problem is not that objects don’t get deallocated, but rather that they are deallocated much more slowly than they are allocated, which, for slice-allocated objects, mean that the chunks get expanded. Unfortunately the idea of just making the producer/consumer use normal heap-allocated objects is not appealing. With the speed they are allocated, the allocate/free overhead is realistically bad.

I’m now considering a few possible solutions to these problem, but seriously it looks like my best choice is to limit the rate of buffer allocations. But before I do that, I wish to flesh out, for me, and whoever is reading or is going to have this problem, what is causing the issue and why on earth does it only happens with the AVI file and not with the QuickTime format file. The reason is actually a mix of different factors, both on account of the file format itself and on the way feng works; I can actually do my best to reduce the issues related to the latter but I can’t do much about the former.

Let’s start with the file; the AVI name stands for Audio and Video Interleaved which is what the file actually is; the video and audio data is interleaved in chunks; and by design it does not really try to equilibrate the audio and video parts for streaming as much for decoding on old computers with very low horsepower. For this reason to get enough data to stream for all the tracks, the demuxer has to read many more chunks that it’s likely doing for more streaming-oriented formats like QuickTime.

This ties down with what feng is doing; every time an RTP packet is sent to the client, a request is triggered for the demuxer to read a packet, both for audio and video tracks; unfortunately, this usually leads to many more chunks of data to be read together, since a single read of the demuxer is producing more data than can be sent through a single RTP packet. It’s almost literally eating more than it can swallow.

I’m already planning out in my mind a few possible ways to avoid this situation, but I also have to think that this wasn’t happening before with the old bufferpool library; the reason for that is that, quite simply, it didn’t have a seemingly infinite memory area to play with, unlike the slab allocator. The concept of bufferpool’s slots allowed for resources to be limited by force, and when more data was produced than it could be consumed it either stopped waiting for a slot to free or started dropping buffers away. While it worked I still don’t like those choices.

The one way I think I’m preferring now among the ones I thought about is to add a maximum length to the queue, at least of unseen packets. While simply having a maximum amount of elements in the queue could work, as well as having a circular array like bufferpool used to have, I’m not too keen on that idea because it also means that a single slow client in a multicast stream could stop all the others, or at least slow them down as well; limiting the tide of the queue, the added buffers that nobody has seen yet, seems to me like having a better potential.

There is also the problem of the read requests happening quite too soon; since we know that there is no symmetry between the number of RTP packets sent and the number of times the resource is going to be read, we cannot just request the demuxer to read data as soon as we send an RTP packet. What we have to do is to keep a count on how many buffers haven’t been sent yet to the client and if we see that going too low, we just request a new read.

Funnily enough, both these problems seems like have been already thought about by the original developers of feng, and you can see indication on the in the way both bufferpool and feng already had names calling back to these ideas. Indeed the original read request was made through an event_buffer_low() function; unfortunately both its use and its implementation, by the time I started working on it, diverged from the original idea.

Okay I know this post has been almost surely uninteresting to all the people reading me that are not interested in lscube development; on the other hand, I needed to write this down to clear my mind anyway, and someone might actually find this interesting, after all.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.