A smarter netcat?

Today I restored data from the hard disk of the previous laptop of a friend of mine, who had her laptop burn out (almost literally) a few months ago. To do so, my usual first task is to back up the whole content of the drive itself, so that even if I screw up with the data there is no data loss on the original drive. Having more than 2TB addressable on this workstation certainly helps.

Now, I’m really not sure whether the copy operation was network-bound or disk-bound, and if it was disk whether it was the reading or the writing, but it certainly wasn’t exactly blazing fast, around an hour to copy 120GB. Sure it’s an average of 2GB/min so it’s not really slow either, but still I see room for improvement.

The host reading the data (Enterprise living again) booted from SysRescueCD, so the CPU was not doing anything else but that, so certainly it’s not a problem with CPU power; Yamato was building, but still it wasn’t bound on the CPU, that seemed obvious. The source was a 2.5” SATA drive, while the destination was an ext4 partition in LVM, on one of my SATA-II disks. The medium, classic nc over Gigabit Ethernet point-to-point connection (I borrowed the connection I use for the laptop’s iSCSI to back it up with Time Machine).

Now, I admit this is not something I do that often, but it’s something I do from time to time nonetheless, so I think that spending a couple of hours thinking of a solution could help, especially since I can see similar situations happening in the future where I’m much tighter with time. The one thing that seemed obvious to me was a lack of a compression layer. Since I wasn’t using a tunnel of any sort, netcat was sending the data through the TCP connection as it was read from the disk, without any kind of compression.

While data deduplication, as was suggested to me on a previous related post would be cool, it might be a bit of overkill in this situation, especially considering that Partimage support for NTFS is still experimental (and as you might guess I tend to have this kind of needs with non-Linux systems, mostly Windows and much more rarely Mac OS X). But at least being able to compress the vast areas of “zeros” in the disk would have had a major gain.

I guess that even just hooking up a zlib filter could have reduced the amount of traffic, but I wonder why isn’t there a simple option to handle the compression/decompression transparently rather than via a pipe filter. While filters are cool when you have to do advanced mojo on your inputs/outputs, tar has shown how it’s much easier to just have an option to handle that. I guess I’ll have to investigate alternatives to netcat, although I’d like it if the alternative would just allow me to use a gzip filter to handle the stream with a standard non-smart nc command.

I guess I’ll try my filter idea next time I use the network to transfer an hard disk image, for now the next image I have to take care of is of my mother’s iBook, which supports the IEEE1394 (Firewire) target mode, which is so nice to have.

Software sucks, or why I don’t trust proprietary closed source software

If you follow my blog since I started writing, you might remember my post about imported libraries from last January and the follow up related to OpenOffice; you might know I did start some major work toward identifying imported libraries using my collision detection script and that I postponed till I had enough horsepower to run the script again.

And this is another reason why I’m working on installing as many packages as possible on my testing chroot. Now, of course the primary reason was to test for --as-needed support, but I’ve also been busy checking build with glibc 2.8, and GCC 4.3, and recently glibc 2.9 . And in addition to this, the build is also providing me with some data about imported libraries.

With this simple piece of script, I’m doing a very rough cut analysis of the software that gets installed, to check for the most commonly imported libraries: zlib, expat, bz2lib, libpng, jpeg, and FFmpeg:

rm -f "${T}"/flameeyes-scanelf-bundled.log
for symbol in adler32 BZ2_decompress jpeg_mem_init XML_Parse avcodec_init png_get_libpng_ver; do
    scanelf -qRs +$symbol "${D}" >> "${T}"/flameeyes-scanelf-bundled.log
if [[ -s "${T}"/flameeyes-scanelf-bundled.log ]]; then
    ewarn "Flameeyes QA Warning! Possibly bundled libraries"
    cat "${T}"/flameeyes-scanelf-bundled.log

This checks for some symbols that are usually not present without the rest of the library, and although it gives a few false positives, it does produce interesting results. For instance while I knew FFmpeg is very often imported, and I expected zlib to be copied in every other software, it’s interesting to know that expat as much used as zlib, and every time it’s imported rather than used from the system. This goes for both Free and Open Source Software and for proprietary closed-source software. The difference is that while you can fix the F/OSS software, you cannot fix the proprietary software.

What is the problem with imported libraries? The basic one is that they waste space and memory since they duplicate code already present in the system, but there is also one other issue: they create situations where even old, known, and widely fixed issue remain around for months, even years after they were disclosed. What preserved proprietary software this well to this point is mostly related to the so-called security through obscurity. You usually don’t know that the code is there and you don’t know in which codepath it’s used, which makes it much harder for novices to identify how to exploit those vulnerabilities. Unfortunately, this is far from being a true form of security.

Most people would now wonder, how can they mask the use of particular code? The first option is to build the library inside the software, which hides it to the eyes of the most naïve researchers; by not loading explicitly the library it’s not possible to identify its use through the loading of the library itself. But of course the references to those libraries remain in the code, and indeed most of the times you’ll find the libraries’ symbols as defined inside executables and libraries of proprietary software. Which is exactly what my rough script checks. I could use pfunct from the seven dwarves to get the data out of DWARF debugging information, but proprietary software is obviously built without debug information so it would just waste my time. If they used hidden visibility, finding out the bundled libraries would be much much harder.

Of course, finding which version of a library is bundled in an open source software package is trivial, since you just have to look for the headers to find the one defining the version — although expat often is stripped of the expat.h header that contains that information. On proprietary software is quite more difficult.

For this reason I produced a set of three utilities that, given a shared object, find out the version of the bundled library. As it is it quite obviously doesn’t work on final executables, but it’s a start at least. Running these tools on a series of proprietary software packages that bundled the libraries caused me some kind of hysteria: lots and lots of software still uses very old zlib versions, as well as libpng versions. The current status is worrisome.

Now, can somebody really trust proprietary software at this point? The only way I can trust Free Software is by making sure I can fix it, but there are so many forks and copies and bundles and morphings that evaluating the security of the software is difficult even there; on proprietary software, where you cannot be really sure at all about the origin of the software, the embedded libraries, and stuff like that, there’s no way I can trust that.

I think I’ll try my best to improve the situation of Free Software even when it comes to security; as the IE bug demonstrated, free software solutions like Firefox can be considered working secure alternatives even by media, we should try to play that card much more often.

Using dwarves for binaries’ data mining

I’ve wrote a lot about my linking collisions script that also shown the presence of internal copies of libraries in binaries. It might not be understood that this is just a side-effect, and that the primary scope of my script is not to find the included libraries, but rather to find possible collisions between two software with similar symbols and no link between them. This is what I found in Ghostscript bug #689698 and poppler bug #14451 . Those are really bad things to happen and that was my first reason for writing the script.

One reason why this script cannot be used with discovery of internal copies of libraries as main function is that it will not find internal copies of libraries if they have hidden visibility, which is a prerequisite for properly importing an internal copy of whatever library (skipping over the fact that is not a good idea to do such an import).

To find internal copies of libraries, the correct thing to do is to build all packages with almost full debug information (so -ggdb), and use the dwarf data in them to find the definition of functions. These definitions won’t disappear with hidden visibility so they can be relied upon.

Unfortunately parsing DWARF data is a very complex matter, and I doubt I’ll ever add DWARF parsing support to ruby-elf, not unless I can find someone else to work with me on it. But there is already a toolset that you can use for this: dwarves (dev-util/dwarves). I haven’t written an harvesting and analysis tool yet, and I’m just wasting a lot of CPU cycles to scan all the ELF files for single functions, at the moment, but I’ll soon write something for that.

The pfunct tool in dwarves allows you to find a particular function in a binary file, I ran pfunct over all the ELF files in my system, looking for two functions up to now: adler32 and png_free. The first is a common function from zlib, the latter is, obviously, a common function from libpng. Interestingly enough, I found two more packages that use an internal copy of zlib (one of which is included in an internal copy of libpng): rsync and doxygen.

It’s interesting to see how a base system package like rsync is suffering from this problem. It means that it’s not just uncommon libraries to be bundled by remotely used programs, but also widely known and accepted software to include omnipresent libraries like zlib.

I’m now looking for internal copies of popt, which I’ve also seen imported more than a couple of time by software (cough distcc cough), and is a dependency of system packages already. The problem is that dwarf parsing is slow and takes time for pfunct to scan all the system. That’s why I should use another harvest and an analyse script.

Oh well, more analysis for the future :) And eliasp, when I’ve got this script done, then I’ll likely accept your offer ;)