On the time taken to stable stuff

In my previous post about the possibility of me leaving and why, a few people commented on the “staleness” of the stable (and unstable, to some extent) trees in Gentoo. Now, I won’t argue that there are no problem; I actually said so myself a few months ago. But I’d like to clarify a few points related to the process of marking packages as stable.

First of all, we have to try to discern about two different types of staleness: single-package staleness versus systematic staleness; the latter case is what we had regarding Perl, and it’s much more complex than most users think. It wasn’t just a matter of making Perl itself to work, it also involved making sure that the packages using Perl worked properly. This is also not as easy done as it’s said: while perl-cleaner can take care of re-install the packages that link to or extend Perl, it does not take care of Perl-written scripts; even looking at the reverse dependencies doesn’t suffice as we’re still omitting system-set dependencies (and guess what? Perl is in the system set!).

And even if we were able to track down all the packages using Perl, directly or indirectly, in the tree, and I could get all of them passed through the Tinderbox (I did), we wouldn’t be too sure about its absolute solidity because half those scripts don’t have testsuites (see also this post by Ryan on the subject of tests). On the whole, though, we caught a few important failures, and we could asses that the tree was mostly ready, and so, finally, Perl 5.10 entered the tree, yai! I don’t think the road to stable is going to take much more time, by the way; I’d be temped to try soonish to use it for xine’s bugzilla (at least mirrored on this server first of course).

A similar issue happens with Ruby 1.9. We’re taking our time to get it unleashed even in unstable; right now it’s tightly masked. Why? We’ve been struggling with lots of trouble, that came then down to the Ruby-NG eclasses and that is one very important piece of the puzzle for us, as it allows to properly support multiple Ruby implementations without breaking the dependency tree or the general solidity of the system. It’s not perfect yet, and needs polishing, but for instance a couple of days ago I committed changes to the eclasses that ensures that scripts installed for a single implementation won’t break when a different one is selected (partly covers the problem described here but not entirely; to cover that properly we’re going to take a bit more time, I’m afraid, as we’re going to revisit the whole idea of selected Ruby implementation). And if you think that Ruby 1.9 is ready for prime-time right now, there’s a reality check waiting for you. Yes I rant about Ruby, but I think I also have a positive stance as I have patches (and a lot of those are merged upstream and even released).

You might have noticed my use of the first-person-plural pronoun (“we”). I’m going most likely to take a break from active Gentoo work, and try to reduce it for a while to the areas I’m interested in, trying not to feel too pressured about it. For instance I unsubscribed from the Gemcutter feed that tells me when Ruby packages are released so I don’t feel the urge to bump them. On the other hand, I don’t think it’s feasible for me to leave Gentoo, at least for what concerns Ruby and other things I work on. Worse comes to worse, I’ll get a frontend machine and use Fedora or something on that, with Gentoo as the backend server. Obviously if nothing can be fixed regarding the issue I brought up, I’m not going to stick for long, but I hope I got enough people to think about the problem that it can be solved — in Utopia at least.

So I have just shown you two systemic staleness problems in Gentoo (one which is partly solved, one that is actually caused by an external lack of stability that we are trying to resolve at the roots). What about the single-package staleness? Well there are many examples of that and the problems can range between very wide areas. People forget to ask stuff to be marked stable; developers might not think anybody needs that stuff stable anyway, packages might require specific hardware to be marked stable but no developer with such hardware can do that (think about the EntropyKey software that I maintain(ed) in the tree: you cannot say whether it works or not without having an hardware key yourself; I don’t know of any other Gentoo developer having them, so what would happen if I left?), or they might have complicated testing procedures that are difficult to reproduce.

On these matters, the amount of people working on the stabling process is not a binding factor; throwing more people at the problem is not going to solve it any sooner (by the way for this last phrase of mine, I’ll most likely be posting something in the next few days, again to make some points on why did I reach the bad point of snapping). Not unless you throw the right people at the right problem. The problem here is not really the stabling part, it might actually take very little time, the problem is that we have to document things, such as the testing procedures. Sometimes we have thorough testsuites, most of the time we don’t (in the case of Ruby, even when we have, they can be… tricky). I tried something, some time ago but it didn’t turn out what I was hoping for, at the end I actually stopped working on finishing that one because my half-easter egg, half-free culture community collaboration (alliterations…) crashed down in flames as the source I wanted to use, Jamendo, couldn’t get his own facts straight.

I don’t want for this post to go too deeply into the technical problems of testing, as this is better discussed separately, and most people interested in the topic I’m writing about might not be interested in the technical details. Let’s just say that I have seen a huge improvement in tests in the past few months. And further kudos to two teams who I know are documenting post-build testing procedures to indicate arch teams what to look at when testing their packages: Java and Emacs teams.

Now comes what might disappoint a few users, those users who think and asserts that the solution to staleness is the reckless commit of half-broken ebuilds, like Samba. I’m going to argue that the opposite is true (and I’m again borrowing a line out of NewsQuiz… I might have been listening too much to that program; my actual post style yesterday was probably deeply influenced by the newly-restarted Real Time with Bill Maher instead, but I digress).

First of all we have to agree on one point: staying a lot behind upstream sucks. Sucks for users and sucks for upstream as well. As Joost, from Sabayon, said to me earlier today (I’m following The Other Diego’s philosophy that today starts when I wake up, and ends when I fall asleep), upstream will be bothered if users won’t be testing their recent versions at all, and would rather stick to old, known-broken, already-fixed versions. Having been (and still being) on both side of the fences, upstream and downstream I can tell you that the best feeling is that when you can actually have distributions always using your latest, greatest code. This is, though, not always that simple, or feasible at all, because of upstream’s own actions, but again this is a topic for a different day.

Back to our reckless commits we go. Let’s take the example of Samba, since that’s what a commenter named, and something that, I think, is showing best what the trouble “Developer B” consists of. One of his justification is that the current stable Samba is vulnerable; I’m afraid to tell you all guys that it might well be true. I use a conditional here, because I didn’t have the time, nor the will, to track down whether it’s actually true or just speculation — it is, though, true that our Security team, also understaffed, hadn’t had time to deal with all the lower-level security issues in a few weeks; I’m pretty sure they’ll catch up soon. Now, if the problem was security, we should be striving to get the new ebuilds stabled soon, shouldn’t we? And to do that, you should be working actively to reduce the amount of bugs in those ebuilds.

Neither seem to be happening; the stable tracking bug reports actually that x86 is waiting, and last I checked with them, they were actually tempted to go with 3.3 still. This is quite understandable, as 3.4 is now fully split, but unpolished and without any plan on how to migrate from monolithic to split – at the time X.Org went through the splitting up, Donnie planned up months ahead, now of course that consisted of over a hundred packages, maybe even a few hundreds, and this is a much smaller scale, but the very fact that “Developer B” when asked about a migration plan replied me that it was too boring to do should set the mood straight on the issue. Not that it’s going to matter anyway, 3.5, and maybe even 3.4, is going to be monolithic again. Yes you’re going to get blocker, removal and so on again on unstable, oh joy!

And these bugs are assigned to a team that does not include our mysterious “Developer B” as he didn’t add himself to the alias, as I said before. Is he CCed on any of those? Nope; okay this might be QA’s (my!) fault, as I should have noticed earlier that he wasn’t on the alias and either reported that to devrel, or added him forcefully. Now, as most of these issues are important you’d expect that he’d be working on finishing this task, rather than going off and, oh, go on bumping another subsystem that he doesn’t even use. But he won’t care; why? Because he admitted many times he does not even use Samba! Nor Mono! Again, try to wrap your mind around this concept: how can he be improving the situation for theirs users, not being one himself? Not feeling the pain, nor sharing the gain?

Any kind of reckless bump and non-trivial change in a subsystem will require a long time to deal with, and the more you tend to stray away from the upstream-sanctioned behaviour, the more you’re going to suffer when it’s time to follow their lead. When you have to make big changes you compromise. One of these compromise in Ruby land has been that of trying to get the latest non-ported ebuild stabled if an old stable was present, before moving fully toward Ruby-NG. It’s going to have some growing pains, and yes, you have to use fully unstable (for the Ruby ebuilds, not for the whole tree, of course!) for it to work for now, but we are usually quite conservative on making sure that it works as intended.

When you make big changes, and you don’t plan, nor compromise, on how to deal with them on the long run, you’re just going to suffer, or you might just end up with an even more stale stable tree than you started with. On the other hand, it might be much, much worse if the stable tree gets broken badly, because packages that haven’t been planned ahead are moved there to remove the staleness. And by the way, this does not mean that I’m not saddened by the fact that to use Gentoo properly on things like vserver, xen or lxc guests, you’re basically forced to use some unstable packages, as OpenRC is the only one that works, and Baselayout 1 is definitely rotting in tree. Unfortunately I’m also quite sure that there are packages that are not fixed for that yet.

Anyway, to cut this post short so I can also get some sleep and do more useful work tomorrow, I’d like to point out to the concept of marginal cost that I was introduced to by a splendid book by Richard Dawkins on evolution. The marginal cost of stabling something depends on many factors; one of them is the amount of changes since the previous revision (which is why major version bumps, or total changes in the system’s packaging, make it harder to stable it), another is regressions from the previous version (dropping patches that no longer apply, but are still not fixed upstream increases the marginal cost tenfold). Our perfect setup is to always have a very low marginal cost for stabling, and that means not changing the ebuilds in any drastic way unless strictly needed.

But if we take the example of the recurrent laryngeal nerve that Dawkins uses in his book as a proof of evolution, we can easily see that we’re not in a biological evolution scenario, we can make drastic changes when needed to solve a situation that is blatantly out place. In such cases, though, we’re going to increase our marginal cost for stabling… and have to accept a longer stable delay. And that will bring us to various possible ways to tackle that, which are too technical for most of the people reading this in the first place, and that I’ll discuss in the next days instead.

6 thoughts on “On the time taken to stable stuff

  1. How about a “How to help getting your new program version into Gentoo quickly – for developers” ?That wouldn’t solve the internal problems, but could help making work easier in future.

    Like

  2. Arne, IMHO the developers should know without telling. If a dev gets commit rights, then the person giving the commit right makes sure if the new dev knows what he’s doing.Nonetheless, AFAIK Diego is actually working on changes to the dev manual since it is terribly outdated. Apart from documentation, also mechanisms to ensure/enforce the documented rules need to be in place and this is where Gentoo has been having problems. Take the portage tree: still running on cvs …

    Like

  3. The main issue that makes gentoo testing difficult is the plethora of packaging options. Split ebuilds, use flags, arch, masks, etc… I think its time that the gentoo developers sit down and determine better ways to simplify the process.It just all too much for the human brain to take in and it is getting worse.Too many features == almost impossible to test

    Like

  4. I would like to ask. If you eventually decide on dropping your support for Gentoo, would you do the same for another Distro. Like arch? (what’s your opinion on it?)

    Like

  5. Actually, Gentoo is not stale.Check out this study:http://oswatershed.org/Gentoo tracks evenly with Ubuntu, overall, at both the stable and testing level. The only major distro that’s more bleeding edge is Fedora.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s