This is a bit of an interesting corner case of a rant. I have not written this when I came up with it, because I came up with it many years ago when I actively worked on multimedia software, but I have only given it in person to a few people before, because at the time it would have gained too much unwanted attention by random people, the same kind of people who might have threatened me for removing XMMS out of Gentoo so many years ago. I have, though, spoken about this with at least one of the people working on PulseAudio at the time, and I have repeated this at the office a few times, while the topic came up.
For context you may want to read this rant from almost ten years ago by Mike Melanson, who was at the time working for Adobe on Flash Player for Linux. It’s a bit unfortunate that the drawings from the post are missing (but maybe Mike has a copy?) You can find the missing drawing on the Internet Archive as well, but the whole gist is that the Linux Audio API were already bloody confusing at the time, and this was before PulseAudio came along to stay. So where are we right now?
Well, the good news is that for the most part things got simpler: aRTs and ESounD are now completely gone, eradicated in the favour of PulseAudio, which is essentially the only currently used consumer sound daemon. Jack2 is still the standard for the pro-audio crowd, but even those people seem to have accepted that multimedia players are unlikely to care for it, and it should be limited to proaudio software. On the kernel driver side, the actually fairly important out-of-kernel drivers are effectively gone, in favour of development happening as a separate branch of the Linux kernel itself (GIT was not a thing at the time, oh how things have changed!) and OSS is effectively gone. I don’t even know if it’s available in the kernel, but the OSS4 fanboys have been quiet for long enough that I assume they gave up too.
ALSA itself hasn’t really changed much in all this time, either in the kernel or as userland. In the kernel, it got more complex for supporting things like jack sense, as HDA started supporting soft-switching between speaker and headphones output. In the userland, the plugins interface that was barely known before is now a requirement to properly use PulseAudio, both in Gentoo and in most other distributions. Which effectively makes my rant not only still relevant, but possibly more relevant. But before I go into details, I should take a step back and explain what the whole thing with userland and drivers is, with ALSA. I’ll try to simplify the history and the details, so if you know this very well you may notice I may skip some details, but nobody really cares that much about those.
The ALSA project was born back when Linux was in version 2.4 — and unlike today, that version was the version for a long time. Indeed up until version 3.0, a “minor” version would just be around forever; the migration from 2.4 to 2.6 was a massive amount of work and took distributions, developers and users alike a lot of coordination. In Linux 2.4, the audio drivers were based off the OSS interface, which essentially meant you had /dev/dspX
and /dev/mixerX
, and you were done — most of the time mixer0
matched a number of dspX
devices, and most devices would have input and output capabilities, but that’s about all you knew. Access to the device was almost always exclusive to one process, except if the soundcard had multiple hardware mixer channels, in which case you could open the device multiple times. If you needed processes to share the device, your only option was to use a daemon such as the already named aRTs or ESounD. The ALSA project aimed to replace the OSS interface (that by then became a piece of proprietary software in its newer versions) with a new, improved interface in the following “minor” version (2.5, which stabilized as 2.6), as well as on the old one through additional kernel modules — the major drawback from my point of view, is that this new interface became Linux-specific, while OSS has been (and is) supported by most of the BSDs as well. But, sometimes you have to do this anyway.
The ALSA approach provides a much more complex device API, but mostly for good reason, because sound cards are (or were) a complex interface, and are not consistent among themselves at all. To make things simpler to application developers who previously only had to use open()
and similar functions, ALSA provided an userland library, provided in a package called alsa-lib
, but more often known as its filename: libasound
. While the interface of the library is not simple either, it does provide a bit of wrapping around the otherwise very low-level APIs. It also abstracts some of the problems away of figuring out which cards are present and which mixer refers to which. The project also provided a number of tools and utilities to configure the devices, query for information or playback raw sound — and even a wrapper for applications implementing OSS access only, in the form of a preloadable library catching accesses to /dev/dsp
to convert them to ALSA API calls — not different from the similar utilities provided by arts, esd or PulseAudio.
In the original ALSA model, access to the device was still limited to one process per channel, but as soundcards with more than one hardware channel became quickly obsolete (particularly as soundcard kind-of standardized over AC’97, then HDA) the need for sharing access arose again, and since both arts and esd had their limits (and PulseAudio was far from ready), the dmix interface arrived — in this setup, the first process opening the device would actually have access, as well as set up a shared memory area for other processes to provide their audio, which then would be mixed together in userland, particularly in the process space of the first process opening the device. This had all sorts of problems, particularly when sharing across users, or when sharing with processes that only used sound for a limited amount of time.
What dmix actually used was the ability of ALSA to provide “virtual” devices, which can be configured for alsa-lib to see. Another feature that got more spotlight thanks to the lowering of featureset in soundcards, particularly with the HDA standard, is the ability to provide plugins to extend the functionality of alsa-lib — for a while the most important one was clearly the libsamplerate-based resampling plugin which almost ten years ago was the only way to provide non-crackling sound out of an HDA soundcard. These plugins included other features, such as a plugin providing a virtual device for encoding to Dolby AC3 so that you could us S/PDIF pass-through to a surround decoder. Nowadays, the really important plugin is the one PulseAudio one, which allows any ALSA-compatible application to talk to PulseAudio, by configuring a default virtual device.
Okay now that the history lesson is complete, let me see to write down what I think is a problem with our current, modern setup. I’ll exclude in my discussion proaudio workstations, as these have clearly different requirements from “mainstream” and most likely would still argue (from a different point) that the current setup is overengineered. I’ll also exclude most embedded devices, including Android, since I don’t think PA ever won over the phone manufacturers outside of Nokia — although I would expect that a lot of them actually do rely on PulseAudio a bit and so the discussion would apply.
In a current Linux desktop, your multimedia applications end up falling into two main categories: those that implement PulseAudio support and those that implement ALSA support. They may use some wrapper library such as SDL, but at the end of the day, these are the two APIs that allow you to output sound on modern Linux. A few rare cases of (proprietary, probably) apps implementing OSS can be ignored, as they would either then use aoss
or padsp
to preload the right library to provide support to whichever stack you prefer. Whichever distribution you’re using all of these two classes of apps are extremely likely to be going out of your speaker through PulseAudio. If the app only support ALSA, the distribution is likely providing a configuration file so that the default ALSA device is a virtual device pointing at the PulseAudio plugin.
When the app talks to PulseAudio directly, it’ll use its API through the client library, that then IPCs through its custom protocol to the PulseAudio Daemon, which will then use alsa-lib through its API, ignoring all the virtual devices configured, which in turn will talk with the kernel drivers through its device files. It’s a bit different for Bluetooth devices, but you get the gist. This at first sight should sound just fine.
If you look at an app that only supports ALSA interfaces, it’ll use the alsa-lib API to talk to the default device, which uses the PulseAudio client library to IPC to the PulseAudio daemon, and so as above. In this case you have alsa-lib on both sides: the source application and the sink daemon. So what am I complaining about? Well here is the thing: the parts of ALSA that the media application uses versus the parts of ALSA that the PulseAudio daemon uses are almost entirely distinct: one only provides access to the virtual devices configured, and the other only gives access to the raw hardware. The fact that they share the API is barely a matter, in my opinion.
From my point of view, what would be a better solution would be for libasound
to be provided by PulseAudio directly, implements a subset of ALSA API, that either show the devices as the sinks configured in PulseAudio or, PA wants to maintain the stream/sink abstraction itself, just a single device that is PulseAudio. No configuration files, no virtual devices, no plugins whatsoever, but if the application is supporting ALSA, it gets automatically promoted to PulseAudio. Then on the daemon side, PulseAudio can either fork alsa-lib, or have alsa-lib provide a simpler library, that only provides access to the hardware devices, and removes support for configuration files and plugins (after all PulseAudio already has its own module system.) Last I heard, there actually is an embedded version of libasound
that implements only the minimal amount of features needed to access a sound device through ALSA. This not only should reduce the amount of “code at play” (pardon the pun), but also reduce the chance that you can misconfigure ALSA to do the wrong thing.
Misconfiguring ALSA is probably the most common reason for your sound not working the way you expect on Linux — the configuration files and options, defaults and so on kept changing, and since ten years ago things are so different that you’re likely to find very bad, old advise out there. And it’s not always clear not to follow it. And for instance for the longest time Adobe Flash, thinking of doing the right thing, would not actually abide to the default ALSA configuration, and rather try to access the hardware device itself (mostly because of nasty bugs with dmix), which meant that PulseAudio wouldn’t be able to access it anymore itself. The quickly sketched architecture above would solve that problem, as the application would not actually be able to tell the difference between the hardware device and the PulseAudio virtual device — the former would just not be there!
And just to close up my ALSA rant, I would like to remember you all, that alsa-lib still comes with its own LISP interpreter: the ALISP dialect was meant to provide even more configurability to the sound access interface, and most distributions, as far as I know, still have it enabled. Gentoo provides a (default-off) alisp
USE flag, so you’re at least spared that part in most cases.
Update 2020-05-08: Adobe appears to have “archived” (or rather deleted) Mike’s blog from their site, some time in 2017. I’ve updated the link above to point at the Internet Archive (to which I recommend donating, if you can). While doing that I even managed to find a copy of the drawing that was missing when I originally wrote this post. Score!