Tim asked me for some help to identify bundled libraries, so I come back to a topic that I haven’t written about in quite a while, mostly because it seemed like a lost fight with too many upstream projects.
First of all let’s focus on what I’m talking about. With the term “bundled libraries” I’m referring to libraries whose sources are copied verbatim, or only slightly modified, within another project. This is usually done by projects with either a number of dependencies, or uncommon dependencies, under the impression it makes it easier for the users to build the project. I maintain that such notion is mostly obsolete as it’s not the place of most random users to actually build their own projects, leaving that job to distributions, which have the inverse problem with bundled libraries: they make packaging harder, and software much more vulnerable to security issues.
So how do you track down bundled libraries issues? Well one option is to check out each packages’ sources, but that’s not really applicable that easily, especially since the sources might be masked — I found more than one project taking a library originally in a number of source files, concatenating all of them together, and then building it in a single object file. The other options relate to analyse the result of the build, executables and libraries. This is what I’ve done with my scripts when I started looking at a different, but slightly related, problem (symbol collisions).
You can analyse the resulting binaries in many ways; the one I’ve been doing with the scripts I’m referring to above is a very shallow detection: it wasn’t designed for the task of looking for bundled libraries, it was designed to identify libraries and executables exporting the same symbols, that would cause trouble if they were ever loaded in the same address space. For those curious, this is because the dynamic loader needs to choose only one symbol for any name, so either the libraries or the executable would end up calling the wrong function or accessing the wrong data. While this can sound too theoretical, one such bug I reported back in 2008 was the root cause of a runtime crash of Evince in 2011.
At any rate this method is not optimal for the task of finding bundled libraries for two reasons: the first is that it only checks for identical symbol names, so it doesn’t take into account libraries that are bundled and have their method renamed – and yes I have seen that done – the second is that it only works with exported symbols, that are those that the dynamic loader can collide on. What is not included here are the so-called local symbols: static symbols, symbols declared with private ELF visibility, and those that are hidden by linker scripts at link time.
To inspect the binaries deeper, you need non-stripped copies; it doesn’t really require you to have them built with DWARF data in them (the debug data that is added when building with, for instance, -ggdb); it can work even just the complete symbol table kept in the .symtab section of final binaries (and usually stripped away). To get the list of all symbols present within a binary, be them data (constants and variables) or code (functions), you can simply use the nm --defined-only command (if you add --dynamic you end up with the same data I’m analysing with my script above, as it changes the table that it uses to look up symbols).
Unfortunately this still requires to find a way to match the symbols even with prefixed versions, and with little false positives; this is why I haven’t worked on a script to deal with this kind of data yet. Although while writing this I can think of a way to make it at least scan for a particular library against a list of executables, that is not really one of the best-performing options available. I guess the proper answer here would be to find a way to generate a regular expression for each library based on the list of symbols they include, and then grep for that over the symbols exported by the other binaries.
Note here: you can’t expect that all the symbols of a bundled library are present in the final binary; when linking against object files or static archives the linker applies logic akin to the --as-needed one and will not bring in object files if no symbol from that file is referenced by the code; if you use other techniques then you can even skip single functions and data symbols that are unused by your code. Bottom line is that even if the full sources of another library are bundled and built, the final executable might not have all of them on it.
If you don’t have access to a non-stripped binary, well then your only option is to run strings over the package and try to find matching strings for the library you’re looking for; if you’re lucky, the library provides version information you can still track down in raw strings. This unfortunately is the roughest option you have available and I’d not suggest doing it for anything you have sources for; it’s a last-resort for proprietary, pre-built software.
Finally, let me say a couple of words regarding identify the version of an exported version of a library, which I have found to be done way too many times. If this happens on an ET_DYN file, such as a shared object, you can easily dlopen() it and then call functions like zlibVersion() to get the actual version of the library embedded. This ends up being pretty important when trying to tell whether a package is bundling a vulnerable version of zlib or libpng, even though it is also not extremely reliable as it is.