On Rake Collections and Software Engineering

autum, earth's back scratcher

Matthew posted on twitter a metaphor about rakes and software engineering – well, software development but at this point I would argue anyone arguing over these distinctions have nothing better to do, for good or bad – and I ran with it a bit by pointing out that in my previous bubble, I should have used “Rake Collector” as my job title.

Let me give a bit more context on this one. My understanding of Matthew’s metaphor is that senior developers (or senior software engineers, or senior systems engineers, and so on) are at the same time complaining that their coworkers are making mistakes (“stepping onto rakes”, also sometimes phrased as “stepping into traps”), while at the same time making their environment harder to navigate (“spreading more rakes”, also “setting up traps”).

This is not a new concept. Ex-colleague Tanya Reilly expressed a very similar idea with her “Traps and Cookies” talk:

I’m not going to repeat all of the examples of traps that Tanya has in her talk, which I thoroughly recommend for people working with computers to watch — not only developers, system administrators, or engineers. Anyone working with a computer.

Probably not even just people working with computers — Adam Savage expresses yet another similar concept in his Every Tool’s a Hammer under Sweep Up Every Day:

[…] we bought a real tree for Christmas every year […]. My job was always to put the lights on. […] I’d open the box of decorations labeled LIGHTS from the previous year and be met with an impossible tangle of twisted, knotted cords and bulbs and plugs. […] You don’t want to take the hour it’ll require to separate everything, but you know it has to be done. […]

Then one year, […] I happened to have an empty mailing tube nearby and it gave me an idea. I grabbed the end of the lights at the top of the tree, held them to the tube, then I walked around the tree over and over, turning the tube and wrapping the lights around it like a yuletide barber’s pole, until the entire six-string light snake was coiled perfectly and ready to be put back in its appointed decorations box. Then, I forgot all about it.

A year later, with the arrival of another Christmas, I pulled out all the decorations as usual, and when I opened the box of lights, I was met with the greatest surprise a tired working parent could ever wish for around the holidays: ORGANIZATION. There was my mailing tube light solution from the previous year, wrapped up neat and ready to unspool.

Adam Savage, Every Tool’s a Hammer, page 279, Sweep up every day

This is pretty much the definition of Tanya’s cookie for the future. And I have a feeling that if Adam was made aware of Tanya’s Trap concept, he would probably point at a bunch of tools with similar concepts. Actually, I have a feeling I might have heard him saying something about throwing out a tool that had some property that was opposite of what everything else in the shop did, making it dangerous. I might be wrong so don’t quote me on that, I tried looking for a quote from him on that and failed to find anything. But it is something I definitely would do among my tools.

So what about the rake collection? Well, one of the things that I’m most proud of in my seven years at that bubble, is the work I’ve done trying to reduce complexity. This took many different forms, but the main one has been removing multiple optional arguments to interfaces of libraries that would be used across the whole (language-filtered) codebase. Since I can’t give very close details of what’s that about, you’ll find the example a bit contrived, but please bear with me.

When you write libraries that are used by many, many users, and you decide that you need a new feature (or that an old feature need to be removed), you’re probably going to add a parameter to toggle the feature, and either expect the “modern” users to set it, or if you can, you do a sweep over the current users, to have them explicitly request the current behaviour, and then you change the default.

The problem with all of this, is that cleaning up after these parameters is often seen as not worth it. You changed the default, why would you care about the legacy users? Or you documented that all the new users should set the parameter to True, that should be enough, no?

That is a rake. And one that is left very much in the middle of the office floor by senior managers all the time. I have seen this particular pattern play out dozens, possibly hundreds of times, and not just at my previous job. The fact that the option is there to begin with is already increasing complexity on the library itself – and sometimes that complexity gets to be very expensive for the already over-stretched maintainers – but it’s also going to make life hard for the maintainers of the consumers of the library.

“Why does the documentation says this needs to be True? In this code my team uses it’s set to False and it works fine.” “Oh this is an optional parameter, I guess I can ignore it, since it already has a default.” *Copy-pastes from a legacy tool that is using the old code-path and nobody wanted to fix.*

As a newcomer to an environment (not just a codebase), it’s easy to step on those rakes (sometimes uttering exactly the words above), and not knowing it until it’s too late. For instance if a parameter controls whether you use a more secure interface, over an old one you don’t expect new users of. When you become more acquainted with the environment, the rakes become easier and easier to spot — and my impression is that for many newcomers, that “rake detection” is the kind of magic that puts them in awe of the senior folks.

But rake collection means going a bit further. If you can detect the rake, you can pick it up, and avoid it smashing in the face of the next person who doesn’t have that detection ability. This will likely slow you down, but an environment full of rakes slows down all the newcomers, while a mostly rake-free environment would be much more pleasant to work with. Unfortunately, that’s not something that aligns with business requirements, or with the incentives provided by management.

A slight aside here. Also on Twitter, I have seen threads going by about the fact that game development tends to be a time-to-market challenge, that leaves all the hacks around because that’s all you care about. I can assure you that the same is true for some non-game development too. Which is why “technical debt” feels like it’s rarely tackled (also on the note, Caskey Dickson has a good technical debt talk). This is the main reason why I’m talking about environments rather than codebases. My experience is with long-lived software, and libraries that existed for twice as long as I worked at my former employer, so my main environment was codebases, but that is far from the end of it.

So how do you balance the rake-collection with the velocity of needing to get work done? I don’t have a really good answer — my balancing results have been different team by team, and they often have been related to my personal sense of achievement outside of the balancing act itself. But I can at least give an idea of what I do about this.

I described this to my former colleagues as a rule of thumb of “three times” — to keep with the rake analogy, we can call it “three notches”. When I found something that annoyed me (inconsistent documentation, required parameters that made no sense, legacy options that should never be used, and so on), I would try to remember it, rather than going out of my way to fix it. The second time, I might flag it down somehow (e.g. by adding a more explicit deprecation notice, logging a warning if the legacy codepath is executed, etc.) And the third time I would just add it to my TODO list and start addressing the problem at the source, whether it would be within my remit or not.

This does not mean that it’s an universal solution. It worked for me, most of the time. Sometimes I got scolded for having spent too much time on something that had little to no bearing on my team, sometimes I got celebrated for unblocking people who have been fighting with legacy features for months if not years. I do think that it was always worth my time, though.

Unfortunately, rake-collection is rarely incentivised. The time spent cleaning up after the rakes left in the middle of the floor eats into one’s own project time, if it’s not the explicit goal of their role. And the fact that newcomers don’t step into those rakes and hurt themselves (or slow down, afraid of bumping into yet another rake) is rarely quantifiable, for managers to be made to agree to it.

What could he tell them? That twenty thousand people got bloody furious? That you could hear the arteries clanging shut all across the city? And that then they went back and took it out on their secretaries or traffic wardens or whatever, and they took it out on other people? In all kinds of vindictive little ways which, and here was the good bit, they thought up themselves. For the rest of the day. The pass-along effects were incalculable. Thousands and thousands of soul all got a faint patina of tarnish, and you hardly had to lift a finger.

But you couldn’t tell that to demons like Hastur and Ligur. Fourteenth-century minds, the lot of them. Spending years picking away at one soul. Admittedly it was craftsmanship, but you had to think differently these days. Not big, but wide. With five billion people in the world you couldn’t pick the buggers off one by one any more; you had to spread your effort. They’d never have thought up Welsh-language television, for example. Or value-added tax. Or Manchester.

Good Omens page 18.

Honestly, I often felt like Crowley: I rarely ever worked on huge, top-to-bottom cathedral projects. But I would be sweeping around a bunch of rakes, so that newcomers wouldn’t hit them, and that all of my colleagues would be able to build stuff more quickly.

Sigh…….

ata4: command 0xca timeout, stat 0xd0 host_stat 0x1
ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata4: status=0xd0 { Busy }
sd 3:0:0:0: SCSI error: return code = 0x8000002
sdc: Current: sense key=0xb
ASC=0x47 ASCQ=0x0
Info fld=0x4acec6
end_request: I/O error, dev sdc, sector 4902614
Buffer I/O error on device sdc5, logical block 4902488
lost page write due to I/O error on sdc5

And as you can guess, ata4’s sdc is one of the new disks, not the old one that was failing already. Okay, next week’s schedule: replace the disk. Up to then, I can’t do much more.

And Seagate’s Seatools proved useless for the second time, as both drives passed the test with flying marks.

Sometimes bad experiences let you learn something

Like having to rebuild one’s /usr/lib64 tree helps to learn that there are quite a few duplicated files installed in a system.

The first thing I have to suggest to anybody who happen to have my problem is: make sure you remove the debug info files before starting the procedure: they are big and a lot, and if you, like me, still have partial directories present, it’s simple to find them and remove them altogether, will save you from a lot of md5sum calls. The script I’m using (actually it’s a oneliner, albeit having two while statements in it) is still a test run with echo rather than the commands themselves, but when I’ll be sure it works as intended, I’ll see to post it here, in case someone else might need it.

Then there are the tricky parts: the script being what it is, it will create a bit of a stir when a given file is present with the same md5sum in different places. This is easier to see with empty files or files containing just ‘n’ having the same MD5SUM (.keep files are the most common offenders on this); to avoid having to copy those files back all over (especially since mtime will be changed, and that is bad), I’ve added a simple -size +1 to skip over files of 1 byte or less. Hopefully should take care of it.

But of course there are duplicated files. PHP is a major offender on this: not only it installs a copy of config.guess and config.sub files, it also have some duplicated libpcre header files, but the absolute winner of the “let’s bloat a system” contest is vmware-server, as it comes with a copy of Perl itself, and some of the files are the same to the MD5SUM!

In addition to this, my script shown that there are packages installing stuff in /usr/lib when they shouldn’t. The multilib-strict warning usually allows to find these packages, but in the case of xc, for instance, there are no arch-dependent files, so multilib-strict does not trigger (obviously). It is not really a problem, as arch-independent files are fine in /usr/lib, but as far as I can see, those files should instead go to /usr/share/xc.

* scribbles something on his TODO list about this *

I’d like some luck with those drives

So it seems like one of my HDDs is actually faulty. The SMART long test frozen my box down solid, not even the cursor blinked anymore.

I now have to test SDA too, right now I only tested SDB (SDA had a run of badblocks going just fine), to make sure it’s not simply a generic SMART problem. If I can pin down the problem to a single hard disk, I can probably just buy a new Seagate Barracuda 7200.10 at €70; if I can’t pin it down safely, I’ll have to buy a new pair, and that would cost me up to €236 – for a pair of 500GB disks that would be a nice investment as in the future I’d like to buy a KVM-capable box and then run quite a few virtual machines (hopefully KVM’s I/O performance is better than qemu’s last time I tried; beside, I should try to use an LVM device as virtual disk, rather than an actual file, should be better).

I’ve now resumed the test, and it still gives me timeout error on ata2 (the SATA controller), I hope sda is clear.

The definitive answer should be given out by the “SeaTools” software provided by Seagate itself, but… although it is nice that they use FreeDOS to make use of it, it seems like my Promise SATA controller (an on-board TX2plus) is not supported, it cannot find any disk to analyse :(

Seagate if you’re listening, you might want to put some effort into FreeDOS development so that controllers like Promise’s TX2plus (that is quite common as an on-board controller, and probably use the same interface of other Promise controllers, or at least one near enough to share the same Linux driver between TX2plus and TX4.

Anyway, it’s not like I can do much more for tonight beside waiting for SMART reports, for now my Gentoo development is suspended, I updated my .away status, for now I’ll handle packages through proxy until i can get a replacement disk. Donations to support the maintenance costs are welcome.

At least now I know I wasn’t too paranoid when I decided to always keep Enterprise’s disks in pair.

Enterprise KO

Tonight, during an emerge -e system (to complete a GCC 4.2 transition), Enterprise’s hard disk started failing on me. /usr got unmounted while merging xcb-util back into the live fs, and quite some fiddling didn’t bring anything useful.

After running xfs_repair from SystemRescueCD, I ended up without /usr/lib64. The files are there, in lost+found, but the directory hierarchy is long gone.

This means that Enterprise, my main box, is now offline. Luckily /home is safe under a software RAID1 that should cover even if one of the disks decide to give up. And in /home I keep basically everything but part of the PAM documentation (which anyway is in /var so it’s also fine from tonight’s failure).

Thanks to Javier (Paya) I now have a plan to restore the data tomorrow: I’ll check for the MD5 of the files in the Portage database, and then put every file where it belongs. The problem for this is that I first have to make sure that the disks aren’t dead on hardware-level, and I’m not sure how to do that.

Suggestion about that are welcome.

And as Joshua said.. what is this with my luck? Farragut’s disk died last week, this week it’s Enterprise’s, even if I run them with a LOT of fans to keep them cool. I suppose I should start NOT working during summer, and keep the computers offline.

Anyway bottomline is that you can’t expect stuff coming from me until I can do a surface check of the two disks and recover my data. And if one of the disks is faulty, I’m afraid you’ll have quite some time to wait till I get back — for a series of reason, I don’t want to run my main box counting on just one disk (and tonight has been a good demonstration of that). And for what it’s worth, I’m in a pinch even with Farragut, as I now don’t have a way to back up the content on another box.

This is what I hate of computers: maintenance costs.

Unexpected downtime

Sorry for those of my readers who tried to contact my blog in the past few days, unfortunately there has been some different reasons why I had to keep it offline, just barely related to my one week break.

The reason why it did go down is that Wednesday at 13:40 I started smelling a bad acid smell in my office (where all my boxes are, included Farragut). I thought it was caused by some equipment in there, some messed up capacitor, maybe one of the two UPSes being faulty, but it was a spray detergent I had in the room, that was too under pressure for the high temperature that we reached in the last few days (37C is too high).

Then Thursday I was entirely offline, thanks to my ISP who had quite big problems on their routers; my neighbours, who has the same exact provider and contract, still had connection, and my 3G phone was ready to let me contact someone (lavish) to set me officially away for a while.

Yesterday then I decided to turn Farragut back on, but it refused to: the disk was too messed up to run fsck on it, and thus it couldn’t mount it correctly. So I decided to take the SATA controller out of Klothos and put it with a spare 160G drive in Farragut, and then import the data. In the process, I also seen some brownish areas on the network card (D-Link Realtek, also smelling funny), and while Klothos was open there I decided to take out also the Intel network card that was in there. Unfortunately, /home was unreadable.

Today I tried recovering /home from the harddisk by mounting it in a USB external enclosure, but the disk refused to get itself copied through dd, and I couldn’t copy the flags by mounting the partition on Linux.. although actually there weren’t many flags in there anyway. Anyway, after a couple of tries, I couldn’t even mount the partition anymore, so.

Luckily the blog’s database was in /var, as well as the GIT repositories, and I maintain a copy of the website’s SVN repository in Enterprise, so the only actual lost things were some personal settings, my zsh configuration files (that were just a copy of the ones I have in Enterprise and Parkesh), my .emacs (almost a copy), and the configured checkouts of typo and gitarella. The first was a bit tricky, mostly because last time I tried upgrading to newer versions of Typo, it crapped out on me, and so I should have to look for the correct revision to use, but I was able to put Typo 4.1 branch up and running, even if I had to edit a couple of things in the middle of it because of a nasty bug in either rails or typo itself (for a series of reasons, I usually end up NOT using the SVN version of rails, but rather the Gentoo ebuilds for it).

Anyway, the blog is back, and I’ll later try to explain how my break week is going (spoiler: it hasn’t been a break, and it is going bad).

Dumped

Of course, I’m not talking of my personal life, as I cannot dump anyone, and I cannot get dumped, at the very least I can be rejected, but there’s no need for that either, as I’m just sure I will die alone.

What I dumped were two packages, today, one being Kerry (sorry I’m not using Beagle anymore here, too much CPU time wasted, and upstream ignored my enquiries, even being a fellow KDE developer and distribution maintainer), and confcache, that provided me only lots of problem and quite a few bad words by developers and users when the caching broke stuff.

This is part of my plan to try re-acquiring some time, possibly before I get the a new job that will take most of my time, so that the packages I maintain can be maintained correctly on the long run.

I’ve also moved the metadata of net-im/kopete so that it’s under KDE herd entirely, and not directly under me. I’ll probably do similar things for other packages, too.

To continue on this plan, I’ll also write a maintainer’s guide for PulseAudio and related packages, so that other people in sound herd can handle it, while I’m not around. And try to draft up some documentation for kde.eclass and autotools.eclass and pam.eclass.

I’m not trying to reduce my commitment on Gentoo, but I’d rather try to focus it when needed: there’s no need now for an exclusive maintenance of net-im/kopete by my side, as it’s now more mature than before, and next release of KDE 3.5 will contain it already, which means we’ll just re-absorb it into kde-base/kopete and be done with that, and for PulseAudio, I don’t feel like I need a special maintainership now that’s mostly fine with Gentoo’s setup. Most of the stuff I maintain directly right now is because required special attention, like xine-lib and vlc; but while the first I want to continue maintain myself, as I’ve come to know how it works and its quirks, I’d rather give VLC maintenance to someone else, as I’ve stopped using it mostly, and the maintainer’s guide already show most of the relative documentation.

So at the end my involvement would be limited to put the package in shape, leaving the rest of the work to the herd (myself included, which means I don’t leave them uncovered). Like they say.. give a man a fish and he’ll eat for the day; teach him how to fish, he’ll eat for his entire life.