ABI changes keeping backward compatibility

So I’ve written about what ABI is and what breaks it and how to properly version libraries and I left saying that I would be writing about some further details on how to improve compatibility between library versions. The first trick is using symbol versioning to maintain backward compatibility.

The original Unix symbol namespace used for symbols (which is the default one) was flat; that means that a symbol (function, variable or constant) is only ever identified by its name. To make things more versatile, Sun Solaris and GNU C library both implemented “symbol versioning”: a method to assign to the same function one extra “namespace” identifier; the use of this feature is split between avoiding symbol collisions and maintaining ABI compatibility.

In this post I’m focusing on using the feature to maintain ABI compatibility, by changing the ABI of a single function, simulating two versions of the same library, and yet allowing software built against the old version to work just fine with the new one. While it’s possible to achieve the same thing in many different systems, I’m going to focus for ease of explanation to glibc-based Linux systems, in particular on my usual Yamato Gentoo box.

While it does not sound like a real life situation, I’m going to use a very clumsy API-compatible function which ABI can be broken, and which simply reports the ABI version used:

/* lib1.c */
#include 

void my_symbol(const char *str) {
  printf("Using ABI v.1n");
}

The ABI break will come in the form of the const specifier being dropped; since this un-restricts what the function can do with the parameter, it is a “silent” ABI break (in the sense that it wouldn’t warn the user if we weren’t taking precaution). Indeed if we were having the new version written this way:

/* lib2.c */
#include 

void my_symbol(char *str) {
  printf("Using ABI v.2n");
}

the binaries would still execute “cleanly” reporting the new ABI (of course if we were changing the string, and the passed string was a static constant string, that would be a problem; but since I’m using x86-64 I cannot show that since it still does not seem to produce a bus error or anything). Let’s assume that instead of reporting the new ABI version the software would crash because something happened that shouldn’t.

The testing code we’re going to use is going to be pretty simple:

/* test.c */
#if defined(LIB1)
void my_symbol(const char *str);
#elif defined(LIB2)
void my_symbol(char *str);
#endif

int main() {
  my_symbol("ciao");
}

Now we have to make the thing resistant to ABI breakage, which involves using inline ASM and GNU binutils extensions in our case (because I’m trying to simplify for now!). Instead of a single my_symbol function, we’re going to define two public my_symbol_* functions, with the two ABI version variants, and then alias them:

/* lib2.c */
#include 

void my_symbol_v1(const char *str) {
  printf("Using ABI v.1n");
}

void my_symbol_v2(char *str) {
  printf("Using ABI v.2n");
  str[0] = '';
}

__asm__(".symver my_symbol_v1,my_symbol@");
__asm__(".symver my_symbol_v2,my_symbol@@LIB2");

The two aliases are the ones doing the magic, the first defines the v1 variant as having the default versioning (nothing after the “at” character), while the v2 variant, which is the default used for linking if no version is defined (two “at” characters), has the LIB2 version. This does not work right away though: first of all the LIB2 version is defined nowhere, second the two v1 and v2 symbols are also exported allowing other software direct access to the two variants. Since you cannot use the symbol visibility definitions here (you’d be hiding the aliases too) you have to provide the linker with a linker script:

/* lib2.ldver */
LIB2 {
     local:
    my_symbol_*;
};

With this script you’re telling the linker that all the my_symbol_ variants are to be considered local to the object (hidden) and you are, as well, creating a LIB2 version to assign to the v2 variant.

But how well does it work? Let’s build two binaries against the two libraries, then execute them with the two:

flame@yamato versioning % gcc -fPIC -shared lib1.c -Wl,-soname,libfoo.so.0 -o abi1/libfoo.so.0.1
flame@yamato versioning % gcc -fPIC -shared lib2.c -Wl,-soname,libfoo.so.0 -Wl,-version-script=lib2.ldver -o abi2/libfoo.so.0.2
flame@yamato versioning % gcc test.c -o test-lib1 abi1/libfoo.so.0.1                              
flame@yamato versioning % gcc test.c -o test-lib2 abi2/libfoo.so.0.2                              
flame@yamato versioning % LD_LIBRARY_PATH=abi1 ./test-lib1       
Using ABI v.1
flame@yamato versioning % LD_LIBRARY_PATH=abi1 ./test-lib2
./test-lib2: abi1/libfoo.so.0: no version information available (required by ./test-lib2)
Using ABI v.1
flame@yamato versioning % LD_LIBRARY_PATH=abi2 ./test-lib1
Using ABI v.1
flame@yamato versioning % LD_LIBRARY_PATH=abi2 ./test-lib2
Using ABI v.2

As you see, the behaviour of the program linked against the original version is not changed when moving to the new library, while the opposite is true for the program linked against the new and executed against the old one (that would be forward compatibility of the library, which is rarely guaranteed).

This actually makes it possible to break ABI (but not API) and then be able to fix bugs, it shouldn’t be abused since it does not always work that well and it does not work on all the systems out there. But it helps to deal with legacy software that needs to be kept backward-compatible and yet requires fixes.

For instance, if you didn’t follow the advice of always keeping alloc/free interfaces symmetrical, and you wanted to replace a structure allocation from standard malloc() to g_malloc() to GSlice, you can do that by using something like this:

/* in the header */

my_struct *my_struct_alloc();
#define my_struct_free(x) g_slice_free(my_struct, x)

/* in the library sources */

my_struct *my_struct_alloc_v0() {
  return malloc(sizeof(my_struct));
}

my_struct *my_struct_alloc_v1() {
  return g_malloc(sizeof(my_struct));
}

my_struct *my_struct_alloc_v2() {
  return g_slice_new(my_struct);
}

__asm__(".symver my_struct_alloc_v0,my_struct_alloc@");
__asm__(".symver my_struct_alloc_v1,my_struct_alloc@VER1");
__asm__(".symver my_struct_alloc_v2,my_struct_alloc@@VER2");

Of course this is just a proof of concept, if you only allocated the structure, two macros would be fine; expect something to be done to the newly allocated function then. If versioning wasn’t used, software built against the first version, using malloc(), would crash when freeing the memory area with the GSlice deallocator; on the other hand with this method, each software built against a particular allocator will get its own deallocator properly, even in future releases.

Shared Object Version

Picking up where I left with my post about ABI I’d like to provide some insights about the “soversion”, or shared object version, that is part of the “soname” (the canonical name of a shared object/library), and its relationship with the -version-info option in libtool.

First of all, the “soname” is the string listed in the DT_SONAME tag of the .dynamic section of an ELF shared object. It represents the canonical name the library should be called with, and it’s used to create the DT_NEEDED entries for the shared objects and dynamic executables depending on it, as well as the canonical name used when opening the library through dlopen() (without the full path).

Usually, the soname is composed of the library’s basename (libfoo.so) followed by a reduced shared object version, but the extent to which is reduced (or not) depends on the standard rules for the operating system and a few other notes. What I’m going to talk about today is that last part, the shared object version, which is probably the most important part of the soname.

First of all, the “soversion” does not correspond to either the package version nor the -version-info parameter (although it is calculated starting from that one); using either directly would be a big mistake, unless you expect to be able to keep a perfect ABI based on your package versioning, in which case you might want to try using the package version, but that’s quite a difficult thing to do.

The part of this version that is embedded in the soname is the version of the ABI, and has to change when the ABI is changed following the rules I shown previously. If this was kept the same between versions and the ABI was broken, software would be going to subtly crash because of the changes in the ABI. By changing the ABI version, and thus the soname, you make the loader refuse to start the program with a different library than it was developed for; of course it does not make the software magically work, but it will at least stop it from crashing further on along the road.

By default on Linux and Solaris, there is a single component used for the soname, as ABI version, at least with libtool, projects following, manually, this rule, and setting their soversion the same as their package version would be providing a single ABI for each major version of their software; I rarely have seen anything like that working out good. Ruby uses a mix of this, by defining two components as the soversion, so that eventually you could have libruby.so.1.8 and libruby.so.1.9 (on the other hand, we rename them to libruby18 and libruby19 so that they don’t collide for other reasons, but that’s beside the point). This works as long as they don’t have to change, for any reason, the ABI of a minor release of Ruby; when that happens, something will certainly break.

The -version-info of libtool is explicitly distinct from the package version, as well as the actual soversion, and is used to provide a consistent library versioning among releases, by providing three components: current, age and revision; they represent the information in form of API/ABI supported and dropped; understanding the separation is quite a time waste but it can be summarised in three simple steps:

  • if you don’t change the interface at all just increase the “interface revision” value;
  • if you make backward-compatible changes (like adding interfaces), increase the “current interface” value and the “older interface age” value, reset “interface revision” to zero;
  • if you make backward-incompatible changes, breaking ABI (removing interfaces for instance), increase the “current interface” value and reset both “older interface age” and ”interface revision” to zero.

Depending on the operating system, this will create a soname change either on backward-incompatible changes (Linux, Solaris and Gentoo/FreeBSD), or with any type of change to the interface (vanilla FreeBSD).

Again, the idea is that each time you might have a backward-incompatible change you get a different soname so that the loader can’t mix and match different interfaces. When you don’t guarantee any ABI stability between versions, usually for internal libraries, like GNU binutils do for libbfd, you just put the package name in the library’s basename rather than soversion, and set the soversion to all zeros, so you get stuff like libbfd-2.20.so.0.0.0. This way you’re sure that, whether you change interfaces or not, an upgrade of your package won’t break others’ software. Of course it should also be enough for people to understand that it should not be used at all since it’s not guaranteed to be stable.

Next step is going to describe the symbol versioning technique to reduce the amount of backward-incompatible changes, to keep the same ABI available until it really has to go.

Application Binary Interface

Following my earlier post about libtool archives, Jorge asked me how they relate to the numbers that come at the end of shared objects; mostly, they don’t, with the exception that the libtool archive keeps a list of all the symlinks with different suffix components for a given library.

I was, actually, already planning on writing something about shared object versioning since I did note about the problems with C++ libraries related to the “Ardour Problem”. Unfortunately it requires first some knowledge of what composes an ABI, and that requires me to write something before going deep in shared object versioning. And this hits on my main missing necessity right now: time. Since I have now more or less two jobs at hand, the time I can spare is dedicated more to Gentoo or my personal problems than writing in-depth articles for the blog. You can, anyway, always bribe me to write about a particular topic.

So let’s start with the basic of what the Application Binary Interface (ABI) is, in the smaller context that I’m going to care about, and how it relates to the shared object versioning topic I wanted to discuss. For simplicity, I’m not going to discuss issues like the various architecture-dependent calling conventions, and, for now, I’m also focusing on software written in C rather than C++; the ABI problems with the latter are an order of magnitude worse than the ones in C.

Basically, the ABI of a library is the interface between that and the software using it; it relates to the API of the interface, but you can maintain API-compatibility and break ABI-compatibility, since in the ABI, you have to count in many different factors:

  • the callable functions, with their names; adding a function is a backward-compatible ABI change (although it also means that you cannot execute something built against the newer library on a system with an older one), removing or renaming a function breaks ABI;
  • the parameters of the function, with their types; while restricting the parameters of a function (for instance taking a constant string rather than a non-constant string) is ABI-compatible, removing those restrictions is an ABI breakage; changing compound or primitive type of a parameter is also an ABI change, since you change their meaning; this is also why using parameters with types like off_t is bad (it depends on the feature-selection macros used);
  • the returned value of functions; this does not only mean the type of it, but also the actual meaning of the value;
  • the size and content of the transparent structures (i.e.: the structures that are defined, and not just declared, in the public header files);
  • any major API change also produces an ABI change (unless symbol versioning is used to keep backward-compatibility); it’s particularly important to note that changing how a dynamically-allocated returned value is allocated does change API and ABI if there is not a symmetrical function to free it; this is why, even for very simple data types, you should have a symmetrical alloc/free interface.

Now there are a few important notes about what I just wrote, and to explain them I want to use FFmpeg as an example; it is often said that FFmpeg has no warranties of ABI compatibility with the same shared object version (I’ll return to that at another time); this is false because FFmpeg developers do pay attention to keep the public ABI compatible between releases as long as the released shared object has the same version. What they don’t guarantee is the ABI-compatibility for internal symbols, and software like VLC, xine and GStreamer used to use the internal symbols without thinking about it twice.

This is why it’s important to use symbol visibility to hide the internal symbols: once you have hidden them you can do whatever you want with them, since no software can rely on them, and have subtle crashes or misbehaviour because of a change in them. But that’s a topic for another day.

Ruby-Elf and Sun extensions

I’ve written in my post about OpenSolaris that I’m interested in extending Ruby-Elf to parse and access Sun-specific extensions, that is the .SUNW_* sections of ELF files produced under OpenSolaris. Up to now I only knew the format, and not even that properly, of the .SUNW_cap section, that contains hardware and software capabilities for an object file or an executable, but I wasn’t sure how to interpret that.

Thanks to Roman, who sent me the link to the Sun Linker and Libraries Guide (I did know about it but I lost the link to it quite a long time ago and then I forgot it existed), now I know some more things about Sun-specific sections, and I’ve already started implementing support for those in Ruby-Elf (unfortunately I’m still looking for a way to properly test for them, in particular I’m not yet sure how I can check for the various hardware-specific extensions — also I have no idea how to test the Sparc-specific data since my Ultra5 runs FreeBSD, not Solaris). Right at the moment I write this, Ruby-Elf can properly parse the capabilities section with its flags, and report them back. Hopefully, with no mistakes, since only basic support is in the regression test for now.

One thing I really want to implement in Ruby-Elf is versioning support, with the same API I’m currently using for GNU-style symbol versioning. This way it’ll be possible for ruby-elf based tools to access both GNU and Sun versioning information as it was a single thing. Too bad I haven’t looked up yet how to generate ELF files with Sun-style versioning support. Oh well, it’ll be one more thing I’ll have to learn. Together with a way to set visibility with Sun Studio, to test the extended visibility support they have in their ELF extended format.

In general, I think that my decision of going with Ruby for this is very positive, mostly because it makes it much easier to support new stuff by just writing an extra class and hook it up, without needing “major surgery” every time. It’s easy and quick to implement new stuff and new functions, even if the tools will require more time and more power to access the data (but with the recent changes I did to properly support OS-specific sections, I think Ruby-Elf is now much faster than it was before, and uses much less memory, as only the sections actually used are loaded). Maybe one day once I can consider this good enough I’ll try to port it to some compiled language, using the Ruby version as a flow scheme, but I don’t think it’s worth the hassle.

Anyway, if you’re interested in improving Ruby-Elf and would like to see it improve even further, so that it can report further optimisations and similar things (like for instance something I planned from the start: telling which shared objects for which there’s a NEEDED line are useless, without having to load the file trough ld.so to use the LD_* variables), I can ask you one thing and one thing only: a copy of Linkers and Loaders that I can consult. I tried preparing a copy out of the original freely available HTML files for the Reader but it was quite nasty to see, nastier than O’Reilly freely-available eBooks (which are bad already). It’s in my wishlist if you want.

A few risks I see related to the new portage 2.2 preserve-libs behaviour

Fellow developer Zhang Le wrote about the new preserve-libs feature from Portage 2.2 that removes the need for revdep-rebuild.

As I wrote on gentoo-dev mailing list when Marius asked for comments, there are a few prolems with its implementation as it is, in my not-so-humble opinion. (Not-so-humble because I know exactly what I’m talking about, and I know it’s a problem).

Let’s take a common scenario, a system where --as-needed as never used, that is updating a common library from ABI 0 to ABI 1 (so with a change of soname). This library might be, for instance, libexpat.

I don’t want to discuss here what an ABI is and what an ABI bump consists of. Let’s just say that when you make an ABI bump you either remove functions, or you change the meaning of some functions (like the parameters, the behaviour or other things like those).

In the first case the bump is annoying but not much of a problem, executables stop being loaded because symbols are undefined; with lazy-loaded executables, they might die in the middle at the moment the undefined symbols is called, but that’s not our concern here.

The problem comes when a function with the same exact name changes meaning, parameters or return type. In this case, the executable might pass too much or too little data to the function, he pointers might be referring to something completely different, or might be truncated. In general, when you change the interface or the meaning of a function, if the executables built to use the previous version are executed with the new version, they’ll either crash down or behave in a corrupted manner. Which are two subtle issues which we should be looking forward to, as they are hard to debug unless you know about them.

So let’s return to our library changing ABI. Let’s say we have libfooA.so.0 and libfooA.so.1 installed, the first is preserved by preserved-libs, the second is the new one. libfooB.so.$anything links to libfooA as it uses it directly, so it will be in the set of packages to rebuild.

Introducing libfooC.so.$anything that links to libfooB.so.$anything, but as destiny wishes, is also using libfooA.

At this point before the ABI bump we have libfooB depending on libfooA.0, and libfooC depending on libfooB and libfooA.0; after the bump, we decide to rebuild only libfooB, which means that libfooB now depends on libfooA.1 while libfooC is still depending on libfooA.0.

What this means is that, minus symbol versioning, the same symbol would have two (probably different) definitions, which will collide one with the other, leading to subtle crashes, misbehaviour and other fun-to-debug problems.

The problem is that the two ABIs of the libraries are both being loaded in the same userspace, which is a very bad thing, unless the symbols are versioned. On the other hand, symbol versioning is a bit of a mess, it’s not implemented by all operating systems, and I find it quite convoluted.

At the moment I don’t see anything in portage that stops you from shooting in your own foot by doing a partial rebuild. I hope I’m mistaken, but if I’m not, please remember to always do a full rebuild, rather than a partial one. Instead of having programs not starting, you might have programs corrupting your data, otherwise.