Gentoo

Senior Engineering: Open The Door, Move Away

As part of my change of bubble this year, I officially gained the title of “Senior” Engineer. Which made me take the whole “seniority” aspect of the job with more seriousness than I did before. Not because I’m aiming at running up the ladder of seniority, but because I feel it’s part of the due diligence of my job.

I have had very good examples in front of me for most of my career — and a few not great ones, if I am to be honest. And so I’ve been trying to formulate my own take of a senior engineer based on these. You may have noticed me talking about adjacent topics in my “work philosophy” tag. I also have been comparing this in my head with my contributions to Free Software, and in particular to Gentoo Linux.

I have retired from Gentoo Linux a few years ago, but realistically, I’ve stopped being actively involved in 2013, after joining the previous bubble. Part of it was a problem with contributing, part of it was a lack of time, and part of it was having the feeling that something was off. I start to feel I have a better impression now of what it is, and it relates to that seniority that I’m reflecting on.

You see, I worked on Gentoo Linux a little longer than I worked at the previous bubble, and as such I could say that I became a “senior developer” by tenure, but I didn’t really gain the insight to become a “senior developer” in deeds, and this is haunting me because I feel it was a wasted opportunity, despite the fact that it taught me many of the things that I needed to be even barely successful in my current job.

It Begins Early

My best guess is that I started working on Gentoo Linux when I was fairly young and with pretty much no social experience. Which combined with the less-than-perfect work environment of the project, had me develop a number of bad habits that took a very long time to grow out of. That is not to say that age by itself is a significant factor in this — I still resent the remark from one of the other developers that not having kids would make me a worse lead. But I do think that if I didn’t grow up to stay by myself in my own world, maybe I would have been able to do a better job.

I know people my age and younger that became very effective leaders years ago — they’ve got the charisma and the energy to get people on board, and to have them all work for a common goal in their own way. I don’t feel like I ever managed that, and I think it’s because for the longest time, the only person who I had to convince to do something was… myself.

I grew up quite lonely — in elementary school, while I can stay I did have one friend, I didn’t really join other kids It’s a bit of a stereotype for the lonely geek, but I have been made fun since early on about my passion for computers, and for my dislike of soccer – I feel a psychiatrist would have a field day to figure out that and the relationship with my father – and I failed at going to church and Sunday school, which ones the only out-of-school mingling for most of the folks around.

Nearly thirty years later I can tell you that the individualism that I got out of this, while having given me a few headstarts in life when it comes to technical knowledge, it held me back long term on the people skill needed to herd the cats and multiply my impact. It’s not by chance that I wrote about teamwork and, without using the word, individualism.

Aside: I’m Jealous of Kids These Days

As an unrelated aside, this may be the reason why I don’t have such a negative view of social networks in general. It was something I was actually asked when I switched jobs, on what my impression of the current situation is… and my point rolls back to that: when I was growing up we didn’t have social networks, Internet was a luxury, and while, I guess, BBSes were already a thing, they would still have been too expensive for me to access. So it took me until I managed to get an Internet connection and discover Usenet.

I know there’s a long list of issues with all kind of social networks: privacy, polarisation, fake news, … But at the same time I’m glad that it makes it much more approachable for kids nowadays, who don’t fit with the crowd in their geographical proximity, to reach out to friendlier bunches. Of course it’s a double-edged sword as it also allows for bullies to bully more effectively… but I think that’s much more of a society-at-large problem.

The Environment Matters

Whether we’re talking about FLOSS projects, or different teams at work, the environment around an individual matter. That’s because the people around them will provide influence, both positive and negative. In my case, with hindsight, I feel I hanged around the wrong folks too long, in Gentoo Linux, and later on.

While a number of people I met on the project have exerted, again with hindsight, a good, positive influence in my way of approaching the world, I also can tell you now that there’s some “go-to behaviours” that go the wrong way. In particular, while I’ve always tended to be sarcastic and an iconoclast, I can tell you that in my tenure as a Gentoo Linux developer I crossed the line from “snarky” to “nasty” a lot of times.

And having learnt to avoid that, and keeping in check how close to that line I get, I also know that it is something connected to the environment around me. In my previous bubble, I once begged my director to let me change team despite having spent less than the two years I was expected to be on it. The reason? I caught myself becoming more and more snarky, getting close to that line. It wouldn’t have served either me or the company for me to stay in that environment.

Was it a problem with the team as a whole? Maybe, or maybe I just couldn’t fit into it. Or maybe it was a single individual that fouled the mood for many others. Donnie’s talk does not apply only to FLOSS projects, and The No Asshole Rule is still as relevant a book as ever in 2020. Just like in certain projects, I have seen teams in which certain areas were explicitly walked away from by the majority of the engineers, just to avoid having to deal with one or two people.

Another emergent behaviour with this is the “chosen intermediate person” — which is a dysfunction I have seen in multiple projects, and teams. When a limited subset of team members are used to “relate” to another individual either within, or outside, the team. I have been that individual in the first year of high school, with the chemistry teacher — we complained loudly about her being a bad teacher, but now I can say that she was probably a bigger expert in her field than most of the other chemistry teachers in the school, but she was terrible with people. Since I was just as bad, it seemed like I was the best interface with her, and when the class needed her approval to go on a fieldtrip, I was “volunteered” to be the person going.

I’ll get back later on a few more reasons why tolerating “brilliant but difficult to work with” people in a project or team is further unhealthy, but I want to make a few more points here, because this can be a contentious topic due to cultural differences. I have worked with a number of engineers in the past that would be described as assholes by some, and grumpy by others.

In general, I think it’s worth giving a benefit of the doubt to people, at first — but make sure that they are aware of it! Holding people to standards they are not aware of, and have no way to course-correct around, is not fair and will stir further trouble. And while some level of civility can be assumed, in my experience projects and teams that are heavily anglophones, tend to assume a lot more commonality in expectation than it’s fair to.

Stop Having Heroes

One of the widely known shorthands at the old bubble was “no heroes” — a reference to a slide deck from one of the senior engineers in my org on the importance of not relying on “heroes” looking after a service, a job, or a process. Individuals that will step in at any time of day and night to solve an issue, and demonstrate how they are indispensable for the service to run. The talk is significantly more nuanced than my summary right now, so take my words with a grain of salt of course.

While the talk is good, I have noticed a little too often the shorthand used to just tell people to stop doing what they think is the right thing, and leave rakes all around the place. So I have some additional nuances for it of my own, starting with the fact that I find it a very bad sign when a manager uses the shorthand with their own reports — that’s because one of my managers did exactly that, and I know that it doesn’t help. Calling up “no heroes” practice between engineers is generally fair game, and if you call up on your own contributions, that’s awesome, too! «This is the last time I’m fixing this, if nobody else prioritizes this, no heroes!»

On the other hand, when it’s my manager telling me to stop doing something and “let it break”, well… how does that help anyone? Yes, it’s in the best interest of the engineer (and possibly the company) for them not to be the hero that steps in, but why is this happening? Is the team relying on this heroism? Is the company relying on it? What’s the long-term plan to deal with that? Those are all questions that the manager should at least ask, rather than just tell the engineer to stop doing what they are doing!

I’ve been “the hero” a few times, both at work and in Gentoo Linux. It’s something I always have been ambivalent about. From one side, it feels good to be able to go and fix stuff yourself. From the other hand, it’s exhausting to feel like the one person holding up the whole fort. So yes, I totally agree that we shouldn’t have heroes holding up the fort. But since it still happens, it can’t be left just up to an individual to remember to step back at the right moment to avoid becoming a hero.

In Gentoo Linux, I feel the reason why we ended up with so many heroes was the lack of coordination between teams, and the lack of general integration — the individualism all over again. And it reminds me of a post from a former colleague about Debian, because some of the issues (very little mandated common process, too many different ways to do the same things) are the kind of “me before team” approaches that drive me up the wall, honestly.

As for my previous bubble, I think the answer I’m going to give is that the performance review project as I remember it (hopefully it changed in the meantime) should be held responsible for most of it, because of just a few words: go-to person. When looking at performance review as a checklist (which you’re told not to, but clearly a lot of people do), at least for my role, many of the levels included “being the go-to person”. Not a go-to person. Not a “subject matter expert” (which seems to be the preferred wording in my current bubble). But the go-to person.

From being the go-to person, to being the hero, and build up a cult of personality, the steps are not that far. And this is true in the workplace as well as in FLOSS projects — just think, and you probably can figure out a few projects that became synonymous with their maintainers, or authors.

Get Out of The Way

What I feel Gentoo Linux taught me, and in particular leaving Gentoo Linux taught me, is that the correct thing for a senior engineer to do is to know when to bow out. Or move onto a different project. Or maybe it’s not Gentoo Linux that taught me that.

But in general, I still think this is the most important lesson is to know how to open the door and get out of the way. And I mean it, that both parts are needed. It’s not just a matter of moving on when you feel like you’ve done your part — you need to be able to also open the door (and make sure it stays open) for the others to pass through it, as well. That means planning to get out of the way, not just disappearing.

This is something that I didn’t really do well when I left Gentoo Linux. I While I eventually did get out of the way, I didn’t really fully open the door. I started, and I’m proud of that, but I think I should have done this better. The blogs documenting how the Tinderbox worked, as well as the notes I left about things like the USE-based Ruby interpreter selection, seems to have been useful to have others pick up where i left… but not in a very seamless way.

I think I did this better when I left the previous bubble, by making sure all of the stuff I was working on had breadcrumbs for the next person to pick up. I have to say it did make me warm inside to receive a tweet, months after leaving, from a colleague announcing that the long-running deprecation project I’ve worked on was finally completed.

It’s not an easy task. I know a number of senior engineers who can’t give up their one project — I’ve been that person before, although as I said I haven’t really considered myself a “senior” engineer before. Part of it is wanting to be able to keep the project working exactly like I want it to, and part of it is feeling attached to the project and wanting to be the person grabbing the praise for it. But I have been letting go as much as I could of these in the past few years.

Indeed, while some projects thrive under benevolent dictators for life, teams at work don’t tend to work quite as well. Those dictators become gatekeepers, and the projects can end up stagnating. Why does this happen more at work than in FLOSS? I can only venture a guess: FLOSS is a matter of personal pride — and you can “show off” having worked on someone else’s project at any time, even though it might be more interesting to “fully make the project one’s own”. On the other hand, if you’re working at a big company, you may optimise working on projects where you can “own the impact” for the time you bring this up to performance review.

The Loadbearing Engineer

When senior engineers don’t move away after opening the door, they may become “loadbearing” — they may be the only person knowing how something works. Maybe not willingly, but someone will go “I don’t know, ask $them” whenever a question about a specific system comes by.

There’s also the risk that they may want to become loadbearing, to become irreplaceable, to build up job security. They may decide not to document the way certain process runs, the reason why certain decisions were made, or the requirements of certain interfaces. If you happen to want to do something without involving them, they’ll be waiting for you to fail, or maybe they’ll manage to stop you from breaking an important assumption in the system at the last moment. This is clearly unhealthy for the company, or project, and risky for the person involved, if they are found to not be quite as indispensable.

There’s plenty of already written on the topic of bus factor, which is what this fits into. My personal take on this is to make sure that those who become “loadbearing engineers” are made sure to be taking at least one long vacation a year. Make sure that they are unreachable unless something goes very wrong, as in, business destroying wrong. And make sure that they don’t just happen to mark themselves out of office, but still glued to their work phone and computer. And yes, I’m talking about what I did to myself a couple of times over my tenure at the previous bubble.

That is, more or less, what I did by leaving Gentoo as well — I’ve been holding the QA fort so long, that it was a given that no matter what was wrong, Flameeyes was there to save the day. But no, eventually I wasn’t, and someone else had to go and build a better, scalable alternative.

Some of This Applies to Projects, Too

I don’t mean it as “some of the issues with engineers apply to developers”. That’s a given. I mean that some of the problems happen to apply to the projects themselves.

Projects can become the de-facto sole choice for something, leaving every improvement behind, because nobody can approach them. But if something happens, and they are not updated further, it might just give it enough of a push that they can get replaced. This has happened to many FLOSS projects in the past, and it’s usually a symptom of a mostly healthy ecosystem.

We have seen how XFree86 becoming stale lead to Xorg being fired up, which in turn brought us a significant number of improvements, from the splitting apart of the big monolith, to XCB, to compositors, to Wayland. Apache OpenOffice is pretty much untouched for a long time, but that gave us LibreOffice. GCC having refused plugins for long enough put more wood behind Clang.

I know that not everybody would agree that the hardest problems in software engineering are people problems, but I honestly have that feeling at this point.

sARTSurday Sci-Fi Books

Not all arts is visual — and so after a number of terrific visual artists, let me bring you some written words artistry. And because I’m trying to just point people at arts rather than provide full book reviews, I’m going to point at a few different authors and different content.

First of all, my Gentoo-focused readers can possibly remember Tobias Klausmann as a Gentoo developer — some of my ex-colleagues might remember Tobias as a colleague as well. The Slingshot Trilogy is an awesome science fiction trilogy of books, that take place in a distant future, where technology progressed, but human interaction… pretty much stayed the same.

Tobias’s work was particularly enjoyable for me, not just because he’s a friend, but because it’s lightweight, it’s dark and gritty, but it also comes with a positive message that if somehow we work together, we can change things. I like that. We have plenty of terrible negative narratives out there.

Speaking of a bit more dark books, this month John Scalzi‘s The Interdependency trilogy came to conclusion with The Last Emperox. This is a bit more gritty, definitely more adult-oriented (and at times NSFW) sci-fi. This is one of the funniest series I’ve read in a while, starting from the various names of starships, but also considering the way characters behave and all.

From adult to young adult, Brandon Sanderson, of Mistborn and The Stormlight Archive fame, published over the past couple of years two out of a sci-fi book series called Skyward. It’s much more clearly aimed at young adults, and it avoids swearing, any more adult themes, and so on.

Of this whole set of sci-fi books, this is the one that I would suggest for those who are looking for readings for teens, or who prefer some more lighthearted readings. I definitely enjoyed it even at my age, particularly because Sanderson is a real artist with words!

To switch again gear, Chen Qiufan‘s Waste Tide is another awesome title, and he brings you not to another galaxy but, if you like me grew up in continental Europe, to a completely different culture. The book take place “fifteen minutes into the future” in China.

While I’m sure that for quite a few people reading this post China is not that far of a place, or that unknown either, I have to say that for me it has been a mystery up to a few years ago. I took a shine to Chinese sci-fi after listening to Christine‘s talk a few years ago, while I was supporting a whole product development team in Shanghai — it was actually very helpful to have a (small) inkling about the different culture when I landed in the city to work closely with them, and I started keeping an eye on anthologies and new books.

Chen Qiufan’s book is probably my favourite when it comes to the “expansion” of the story, and I totally recommend it, even more so in the current political situation.

Happy reading!

Boot-to-Kodi, 2019 edition

This weekend I’m oncall for work, so between me and my girlfriend we decided to take a few chores off our to-do lists. One of the things for me was to run the now episodic maintenance over the software and firmware of the devices we own at home. I call it episodic, because I no longer spend every evening looking after servers, whether at home or remote, but rather look at them when I need to.

In this case, I honestly forgot when it was the last time that I ran updates on the HTPC I use for Kodi and for the UniFi controller software. And that meant that after the full update I reached the now not uncommon situation that Kodi refused to start at boot. Or even when SSH’ing into the machine and starting the service by hand.

The error message, for ease of Googling, is:

[  2092.606] (EE) 
Fatal server error:
[  2092.606] (EE) xf86OpenConsole: Cannot open virtual console 7 (Permission denied)

What happens in this case is that the method I have been using to boot-to-Kodi was to use a systemd unit lifted from Arch Linux, that started a new session, X11, and Kodi all at once. This has stopped working now, because Xorg no longer can access the TTY, because systemd does not think it should access the console.

There supposedly are ways to convince systemd that it should let the user run X11 without so much fluff, but after an hour trying a number of different combinations I was not getting anywhere. I finally found one way to do it, and that’s what I’m documenting here: use lightdm.

I have found a number of different blog posts out there that try to describe how to do this, but none of them appear to apply directly to Gentoo.

These are the packages that would be merged, in order: 
 
Calculating dependencies... done! 
[ebuild   R    ] x11-misc/lightdm-1.26.0-r1::gentoo  USE="introspection -audit -gnome -gtk -qt5 -vala" 0 KiB

You don’t need Gtk, Qt or GNOME support for lightdm to work. But if you install it this way (which I’m surprised is allowed, even by Gentoo) it will fail to start! To configure what you need, you would have to manually write this to /etc/lightdm/lightdm.conf:

[Seat:*] 
autologin-user=xbmc 
user-session=kodi 
session-wrapper=/etc/lightdm/Xsession

In this case, my user is called xbmc (this HTPC was set up well before the rename), and this effectively turns lightdm into a bridge from systemd to Kodi. The kodi session is installed by the media-tv/kodi package, so there’s no other configuration needed. It just… worked.

I know that some people would find the ability to do this kind of customization via “simple” text files empowering. For me it’s just a huge waste of time, and I’m not sure why there isn’t just an obvious way for systemd and Kodi to get along. I would hope somebody builds one in the future, but for now I guess I’ll leave with that.

Distributions are becoming irrelevant: difference was our strength and our liability

For someone that has spent the past thirteen years defining himself as a developer of a Linux distribution (whether I really am still a Gentoo Linux developer or not is up for debate I’m sure), having to write a title like this is obviously hard. But from the day I started working on open source software to now I have grown a lot, and I have realized I have been wrong about many things in the past.

One thing that I realized recently is that nowadays, distributions lost the war. As the title of this post says, difference is our strength, but at the same time, it is also the seed of our ruin. Take distributions: Gentoo, Fedora, Debian, SuSE, Archlinux, Ubuntu. They all look and act differently, focusing on different target users, and because of this they differ significantly in which software they make available, which versions are made available, and how much effort is spent on testing, both the package itself and the system integration.

While describing it this way, there is nothing that scream «Conflict!», except at this point we all know that they do conflict, and the solutions from many different communities, have been to just ignore distributions: developers of libraries for high level languages built their own packaging (Ruby Gems, PyPI, let’s not even talk about Go), business application developers started by using containers and ended up with Docker, and user application developers have now started converging onto Flatpak.

Why the conflicts? A lot of time the answer is to be found in bickering among developers of different distributions and the «We are better than them!» attitude, which often turned to «We don’t need your help!». Sometimes this went all the way to the negative side to the point of «Oh it’s a Gentoo [or other] developer complaining, it’s all their fault and their problem, ignore them.» And let’s not forget of the enmity between forks (like Gentoo, Funtoo and Exherbo), in which both sides are trying to prove being better than the other. A lot of conflict all over the place.

There were of course at least two main attempts to standardise parts of how a distribution works: the Linux Standard Base and FreeDesktop.org. The former is effectively a disaster, the latter is more or less accepted, but the problem lies there: in the more-or-less. Let’s look at these two separately.

The LSB was effectively a commercial effort, which was aimed at pleasing (effectively) only the distributors of binary packages. It really didn’t make much of an assurance of the environment you could build things in, and it never invited non-commercial entities to discuss the reasoning behind the standard. In an environment like open source, the fact that the LSB became an ISO standard is not a badge of honour, but rather a worry that it’s over-specified and over-complicated. Which I think most people agree it is. There is also quite an overreach of specifying the presence of binary libraries, rather than being a set of guidelines for distributions to follow.

And yes, although technically LSB is still out there working, the last release I could find described in Wikipedia is from 2015, and I couldn’t even find at first search whether they certified any distribution version. Also, because of the nature of certifications, it’s impossible to certify a rolling-release distribution, which as it happens are becoming much more widespread than they used to.

I think that one of the problem of LSB, both from the adoption and usefulness point of views, is that it focused entirely too much on providing a base platform for binary and commercial application. Back when it was developed, it seemed like the future of Linux (particularly on the desktop) relied entirely on the ability for proprietary software applications to be developed that could run on it, the way they do on Windows and OS X. Since many of the distributions didn’t really aim to support this particular environment, convincing them to support LSB was clearly pointless.

FreeDesktop.org is in a much better state in this regard. They point out that whatever they write is not standards, but de-facto specifications. Because of the de-facto character of these, they started by effectively writing down whatever GNOME and RedHat were doing, but then grown to be significantly more cross-desktop, thanks to KDE and other communities. Because of the nature of the open source community, FD.o specifications are much more widely adopted than the “standards”.

Again, if you compare with what I said above, FD.o provides specifications that make it easier to write, rather than run, applications. It provides you with guarantees of where you should be looking for your file and which icons should be rendering, and which interfaces are exposed. Instead of trying to provide an environment where a in-house written application will keep running for the next twenty years (which, admittedly, Windows has provided for a very long time), it provides you building blocks interfaces so that you can create whatever the heck you want and integrate with the rest of the desktop environments.

As it happens, Lennart and his systemd ended up standardizing distributions a lot more than LSB or FD.o ever did, if nothing else by taking over one of the biggest customization points of them all: the init system. Now, I have complained about this before that it probably could have been a good topic for a standard even before systemd, and independently from it, that developers should have been following, but that’s another problem. At the end of the day, there is at least some cross-distribution way to provide init system support, and developers know that if they build their daemon in a certain way, then they can provide the init system integration themselves, rather than relying on the packagers.

I feel that we should have had much more of that. When I worked on ruby-ng.eclass and fakegem.eclass, I tried getting the Debian Ruby team, who had similar complaints before, to join me on a mailing list so that we could discuss a common interface between Gems developers and Linux distributions, but once again, that did not actually happen. My afterthought is that we should have had a similar discussion for CPAN, CRAN, PyPI, Cargo and so on… and that would probably have spared us the mess that is go packaging.

The problem is not only getting the distributions to overcome their differences, both in technical direction and marketing, but that it requires sitting at a table with the people who built and use those systems, and actually figuring out what they are trying to achieve. Because in particular in the case of Gems and the other packaging systems, who you should talk with is not only your distribution’s users, but most importantly the library authors (whose main interest is shipping stuff so that people can use them) and the developers who use them (whose main interest is being able to fetch and use a library without waiting for months). The distribution users are, for most of the biggest projects, sysadmins.

This means you have a multi-faceted problem to solve, with different roles of people, and different needs of them. Finding a solution that does not compromise, and covers 100% of the needs of all the roles involved, and requires no workflow change on any one’s part is effectively impossible. What you should be doing is focusing on choosing the very important features for the roles critical to the environment (in the example above, the developers of the libraries, and the developers of the apps using those libraries), requiring the minimum amount of changes to their workflow (but convince them to change the workflow where it really is needed, as long as it’s not more cumbersome than it was before for no advantage), and figuring out what can be done to satisfy or change the requirements of the “less important” roles (distribution maintainers usually being that role).

Again going back to the example of Gems: it is clear by now that most of the developers never cared of getting their libraries to be carried onto distributions. They cared about the ability to push new releases of their code fast, seamlessly and not have to learn about distributions at all. The consumers of these libraries don’t and should not care about how to package them for their distributions or how they even interact with it, they just want to be able to deploy their application with the library versions they tested. And setting aside their trust in distributions, sysadmin only care to have a sane handling of dependencies and being able to tell which version of which library is running on their production, to upgrade them in case of a security issues. Now, the distribution maintainers can become the nexus for all these problems, and solve it once and for all… but they will have to be the ones making the biggest changes in their workflow – which is what we did with ruby-ng – otherwise they will just become irrelevant.

Indeed, Ruby Gems and Bundler, PyPI and VirtualEnv, and now Docker itself, are expressions of that: distribution themselves became a major risk and cost point, by being too different between each other and not providing an easy way to just provide one working library, and use one working library. These roles are critical to the environment: if nobody publish libraries, consumers have no library to use; if nobody consumes libraries, there is no point in publishing them. If nobody packages libraries, but there are ways to publish and consume them, the environment still stands.

What would I do if I could go back in time, be significantly more charismatic, and change the state of things? (And I’m saying this for future reference, because if it ever becomes relevant to my life again, I’ll do exactly that.)

  • I would try to convince people that even on divergence of technical direction, discussing and collaborating is a good thing to do. No idea is stupid, idiotic or any other random set of negative words. The whole point of that is that you need to make sure that even if you don’t agree on a given direction, you can agree on others, it’s not a zero-sum game!
  • Speaking of, “overly complicated” is a valid reason to not accept one direction and take another; “we always did it this way” is not a good reason. You can keep using it, but then you’ll end up with Solaris a very stagnant project.
  • Talk with the stakeholders of the projects that are bypassing distributions, and figure out why they are doing that. Provide “standard” tooling, or at least a proposal on how to do things in such a way that the distributions are still happy, without causing undue burden.
  • Most importantly, talk. Whether it is by organizing mailing lists, IRC channels, birds of a feather at conferences, or whatever else. People need to talk and discuss the issues at hand in clear, in front of the people building the tooling and making the decisions.

I have not done any of that in the past. If I ever get in front of something like this, I’ll do my best to, instead. Unfortunately, this is a position that, in the current universe we’re talking about, would have required more privilege than I had before. Not only for my personal training and experience to understand what should have been done, but because it requires actually meeting with people and organizing real life summits. And while nowadays I did become a globetrotter, I could never have afforded that before.

Gentoo Miniconf 2016

Gentoo Miniconf, Prague, October 2016//embedr.flickr.com/assets/client-code.js

As I noted when I resurrected the blog, part of the reason why I managed to come back to “active duty” within Gentoo Linux is because Robin and Amy helped me set up my laptop and my staging servers for singing commits with GnuPG remotely.

And that happened because this year I finally managed to go to the Gentoo MiniConf hosted as part of LinuxDays in Prague, Czech Republic.

The conference track was fairly minimal; Robin gave us an update on the Foundation and on what Infra is doing — I’m really looking forward to the ability to send out changes for review, instead of having to pull and push Git directly. After spending three years using code reviews with a massive repository I feel I like it and want to see significantly more of it.

Ulrich gave us a nice presentation on the new features coming with EAPI 7, which together with Michal’s post on EAPI 6 made it significantly easier to pick up Gentoo again.

And of course, I managed to get my GnuPG key signed by some of the developers over there, so that there is proof that who’s committing those changes is really.

But the most important part for me has been seeing my colleagues again, and meeting the new ones. Hopefully this won’t be the last time I get to the Miniconf, although fitting this together with the rest of my work travel is not straightforward.

I’m hoping to be at 33C3 — I have a hotel reservation and flight tickets, but no ticket for the conference yet. If any of you, devs or users, is there, feel free to ping me over Twitter or something. I’ll probably be at FOSDEM next year too, although that is not a certain thing, because I might have some scheduling conflicts with ENIGMA (unless I can get Delta to give me the ticket I have in mind.)

So once again thank you for CVU and LinuxDays for hosting us, and hopefully see you all in the future!

GnuPG Agent Forwarding with OpenPGP cards

Finally, after many months (a year?) absence, I’m officially back as a Gentoo Linux developer with proper tree access. I have not used my powers much yet, but I wanted to at least point out why it took me so long to make it possible for me to come back.

There are two main obstacles that I was facing, the first was that the manifest signing key needed to be replaced for a number of reasons, and I had no easy access to the smartcard with my main key which I’ve been using since 2010. Instead I set myself up with a separate key on a “token”: a SIM-sized OpenPGP card installed into a Gemalto fixed-card reader (IDBridge K30.) Unfortunately this key was not cross-signed (and still isn’t, but we’re fixing that.)

The other problem is that for many (yet not all) packages I worked on, I would work on a remote system, one of the containers in my “testing server”, which also host(ed) the tinderbox. This means that the signing needs to happen on the remote host, although the key cannot leave the smartcard on the local laptop. GPG forwarding is not very simple but it has sort-of-recently become possible without too much intrusion.

The first thing to know is that you really want GnuPG 2.1; this is because it makes your life significantly easier as the key management is handed over to the Agent in all cases, which means there is no need for the “stubs” of the private key to be generated in the remote home. The other improvement in GnuPG 2.1 is that there is better sockets’ handling: on systemd it uses the /run/user path, and in general it uses a standard named socket with no way to opt-out. It also allows you to define an extra socket that is allowed to issue signature requests, but not modify the card or secret keys, which is part of the defence in depth when allowing remote access to the key.

There are instructions which should make it easier to set up, but they don’t quite work the way I read them, in particular because they require a separate wrapper to set up the connection. Instead, together with Robin we managed to figure out how to make this work correctly with GnuPG 2.0. Of course, since that Sunday, GnuPG 2.1 was made stable, and so it stopped working, too.

So, without further ado, let’s see what is needed to get this to work correctly. In the following example we assume we have two hosts, “local” and “remote”; we’ll have to change ~/.gnupg/gpg-agent.conf and ~/.ssh/config on “local”, and /etc/ssh/sshd_config on “remote”.

The first step is to ask GPG Agent to listen to an “extra socket”, which is the restricted socket that we want to forward. We also want for it to keep the display information in memory, I’ll get to explain that towards the end.

# local:~/.gnupg/gpg-agent.conf

keep-display
extra-socket ~/.gnupg/S.gpg-agent.remote

This is particularly important for systemd users because the normal sockets would be in /run and so it’s a bit more complicated to forward them correctly.

Secondly, we need to ask OpenSSH to forward this Unix socket to the remote host; for this to work you need at least OpenSSH 6.7, but since that’s now quite old, we can be mostly safe to assume you are using that. Unlike GnuPG, SSH does not correctly expand tilde for home, so you’ll have to know the actual paths we want to write the unix at the right path.

# local:~/.ssh/config

Host remote
RemoteForward /home/remote-user/.gnupg/S.gpg-agent /home/local-user/.gnupg/S.gpg-agent.remote
ExitOnForwardFailure yes

Note that the paths need to be fully qualified and are in the order remote, local. The ExitOnForwardFailure option ensures that you don’t get a silent failure to listen to the socket and fight for an hour trying to figure out what’s going on. Yes, I had that problem. By the way, you can combine this just fine with the now not so unknown SSH tricks I spoke about nearly six years ago.

Now is the slightly trickier part. Unlike the original gpg-agent, OpenSSH will not clean up the socket when it’s closed, which means you need to make sure it gets overwritten. This is indeed the main logic behind the remote-gpg script that I linked earlier, and the reason for that is that the StreamLocalBindUnlink option, which seems like the most obvious parameter to set, does not behave like most people would expect it to.

The explanation for that is actually simple: as the name of the option says, this only works for local sockets. So if you’re using the LocalForward it works exactly as intended, but if you’re using RemoteForward (as we need in this case), the one on the client side is just going to be thoroughly ignored. Which means you need to do this instead:

# remote:/etc/sshd/config

StreamLocalBindUnlink yes

Note that this applies to all the requests. You could reduce the possibility of bugs by using the Match directive to reduce them to the single user you care about, but that’s left up to you as an exercise.

At this point, things should just work: GnuPG 2.1 will notice there is a socket already so it will not start up a new gpg-agent process, and it will still start up every other project that is needed. And since as I said the stubs are not needed, there is no need to use --card-edit or --card-status (which, by the way, would not be working anyway as they are forbidden by the extra socket.)

However, if you try at this point to sign anything, it’ll just fail because it does not know anything about the key; so before you use it, you need to fetch a copy of the public key for the key id you want to use:

gpg --recv-key ${yourkeyid}
gpg -u ${yourkeyid} --clearsign --stdin

(It will also work without -u if that’s the only key it knows about.)

So what is about keep-display in local:~/.gnupg/gpg-agent.conf? One of the issues I faced with Robin was gpg failing saying something about “file not found”, though obviously the file I was using was found. A bit of fiddling later found these problems:

  • before GnuPG 2.1 I would start up gpg-agent with the wrapper script I wrote, and so it would usually be started by one of my Konsole session;
  • most of the time the Konsole session with the agent would be dead by the time I went to SSH;
  • the PIN for the card has to be typed on the local machine, not remote, so the pinentry binary should always be started locally; but it would get (some of) the environment variables from the session in which gpg is running, which means the shell on “remote”;
  • using DISPLAY=:0 gpg would make it work fine as pinentry would be told to open the local display.

A bit of sniffing around the source code brought up that keep-display option, which essentially tells pinentry to ignore the session where gpg is running and only consider the DISPLAY variable when gpg-agent is started. This works for me, but it has a few drawbacks. It would not work correctly if you tried to use GnuPG out of the X11 session, and it would not work correctly if you have multiple X11 sessions (say through X11 forwarding.) I think this is fine.

There is another general drawback on this solution: if two clients connect to the same SSH server with the same user, the last one connecting is the one that actually gets to provide its gpg-agent. The other one will be silently overruled. I”m afraid there is no obvious way to fix this. The way OpenSSH itself handles this for the SSH Agent forwarding is to provide a randomly-named socket in /tmp, and set the environment variable to point at it. This would not work for GnuPG anymore because it now standardised the socket name, and removed support for passing it in environment variables.

TEXTRELs (Text Relocations) and their impact on hardening techniques

You might have seen the word TEXTREL thrown around security or hardening circles, or used in Gentoo Linux installation warnings, but one thing that is clear out there is that the documentation around this term is not very useful to understand why they are a problem. so I’ve been asked to write something about it.

Let’s start with taking apart the terminology. TEXTREL is jargon for “text relocation”, which is once again more jargon, as “text” in this case means “code portion of an executable file.” Indeed, in ELF files, the .text section is the one that contains all the actual machine code.

As for “relocation”, the term is related to dynamic loaders. It is the process of modifying the data loaded from the loaded file to suit its placement within memory. This might also require some explanation.

When you build code into executables, any named reference is translated into an address instead. This includes, among others, variables, functions, constants and labels — and also some unnamed references such as branch destinations on statements such as if and for.

These references fall into two main typologies: relative and absolute references. This is the easiest part to explain: a relative reference takes some address as “base” and then adds or subtracts from it. Indeed, many architectures have a “base register” which is used for relative references. In case of executable code, particularly with the reference to labels and branch destinations, relative references translate into relative jumps, which are relative to the current instruction pointer. An absolute reference is instead a fully qualified pointer to memory, well at least to the address space of the running process.

While absolute addresses are kinda obvious as a concept, they are not very practical for a compiler to emit in many cases. For instance, when building shared objects, there is no way for the compiler to know which addresses to use, particularly because a single process can load multiple objects, and they need to all be loaded at different addresses. So instead of writing to the file the actual final (unknown) address, what gets written by the compiler first – and by the link editor afterwards – is a placeholder. It might sound ironic, but an absolute reference is then emitted as a relative reference based upon the loading address of the object itself.

When the loader takes an object and loads it to memory, it’ll be mapped at a given “start” address. After that, the absolute references are inspected, and the relative placeholder resolved to the final absolute address. This is the process of relocation. Different types of relocation (or displacements) exists, but they are not the topic of this post.

Relocations as described up until now can apply to both data and code, but we single out code relocations as TEXTRELs. The reason for this is to be found in mitigation (or hardening) techniques. In particular, what is called W^X, NX or PaX. The basic idea of this technique is to disallow modification to executable areas of memory, by forcing the mapped pages to either be writable or executable, but not both (W^X reads “writable xor executable”.) This has a number of drawbacks, which are most clearly visible with JIT (Just-in-Time) compilation processes, including most JavaScript engines.

But beside JIT problem, there is the problem with relocations happening in code section of an executable, too. Since the relocations need to be written to, it is not feasible (or at least not easy) to provide an exclusive writeable or executable access to those. Well, there are theoretical ways to produce that result, but it complicates memory management significantly, so the short version is that generally speaking, TEXTRELs and W^X techniques don’t go well together.

This is further complicated by another mitigation strategy: ASLR, Address Space Layout Randomization. In particular, ASLR fully defeats prelinking as a strategy for dealing with TEXTRELs — theoretically on a system that allows TEXTREL but has the address space to map every single shared object at a fixed address, it would not be necessary to relocate at runtime. For stronger ASLR you also want to make sure that the executables themselves are mapped at different addresses, so you use PIE, Position Independent Executable, to make sure they don’t depend on a single stable loading address.

Usage of PIE was for a long while limited to a few select hardened distributions, such as Gentoo Hardened, but it’s getting more common, as ASLR is a fairly effective mitigation strategy even for binary distributions where otherwise function offsets would be known to an attacker.

At the same time, SELinux also implements protection against text relocation, so you no longer need to have a patched hardened kernel to provide this protection.

Similarly, Android 6 is now disallowing the generation of shared objects with text relocations, although I have no idea if binaries built to target this new SDK version gain any more protection at runtime, since it’s not really my area of expertise.

LibreSSL, OpenSSL, collisions and problems

Some time ago, on the gentoo-dev mailing list, there has been an interesting thread on the state of LibreSSL in Gentoo. In particular I repeated some of my previous concerns about ABI and API compatibility, especially when trying to keep both libraries on the same system.

While I hope that the problems I pointed out are well clear to the LibreSSL developers, I thought reiterating them again clearly in a blog post would give them a wider reach and thus hope that they can be addressed. Please feel free to reshare in response to people hand waving the idea that LibreSSL can be either a drop-in, or stand-aside replacement for OpenSSL.

Last year, when I first blogged about LibreSSL, I had to write a further clarification as my post was used to imply that you could just replace the OpenSSL binaries with LibreSSL and be done with it. This is not the case and I won’t even go back there. What I’m concerned about this time is whether you can install the two in the same system, and somehow decide which one you want to use on a per-package basis.

Let’s start with the first question: why would you want to do that? Everybody at this point knows that LibreSSL was forked from the OpenSSL code and started removing code that has been needed unnecessary or even dangerous – a very positive thing, given the amount of compatibility kludges around OpenSSL! – and as such it was a subset of the same interface as its parent, thus there would be no reason to wanting the two libraries on the same system.

But then again, LibreSSL never meant to be considered a drop-in replacement, so they haven’t cared as much for the evolution of OpenSSL, and just proceeded in their own direction; said direction included building a new library, libtls, that implements higher-level abstractions of TLS protocol. This vaguely matches the way NSS (the Netscape-now-Mozilla TLS library) is designed, and I think it makes sense: it reduces the amount of repetition that needs to be coded in multiple parts of the software stack to implement HTTPS for instance, reducing the chance of one of them making a stupid mistake.

Unfortunately, this library was originally tied firmly to LibreSSL and there was no way for it to be usable with OpenSSL — I think this has changed recently as a “portable” build of libtls should be available. Ironically, this wouldn’t have been a problem at all if it wasn’t that LibreSSL is not a superset of OpenSSL, as this is where the core of the issue lies.

By far, this is not the first time a problem like this happens in Open Source software communities: different people will want to implement the same concept in different ways. I like to describe this as software biodiversity and I find it generally a good thing. Having more people looking at the same concept from different angles can improve things substantially, especially in regard to finding safe implementations of network protocols.

But there is a problem when you apply parallel evolution to software: if you fork a project and then evolve it on your own agenda, but keep the same library names and a mostly compatible (thus conflicting) API/ABI, you’re going to make people suffer, whether they are developers, consumers, packagers or users.

LibreSSL, libav, Ghostscript, … there are plenty of examples. Since the features of the projects, their API and most definitely their ABIs are not the same, when you’re building a project on top of any of these (or their originators), you’ll end up at some point making a conscious decision on which one you want to rely on. Sometimes you can do that based only on your technical needs, but in most cases you end up with a compromise based on technical needs, licensing concerns and availability in the ecosystem.

These projects didn’t change the name of their libraries, that way they can be used as drop-rebuild replacement for consumers that keep to the greatest common divisor of the interface, but that also means you can’t easily install two of them in the same system. And since most distributions, with the exception of Gentoo, would not really provide the users with choice of multiple implementations, you end up with either a fractured ecosystem, or one that is very much non-diverse.

So if all distributions decide to standardize on one implementation, that’s what the developers will write for. And this is why OpenSSL will likely to stay the standard for a long while still. Of course in this case it’s not as bad as the situation with libav/ffmpeg, as the base featureset is going to be more or less the same, and the APIs that have been dropped up to now, such as the entropy-gathering daemon interface, have been considered A Bad Idea™ for a while, so there are not going to be OpenSSL-only source projects in the future.

What becomes an issue here is that software is built against OpenSSL right now, and you can’t really change this easily. I’ve been told before that this is not true, because OpenBSD switched, but there is a huge difference between all of the BSDs and your usual Linux distributions: the former have much more control on what they have to support.

In particular, the whole base system is released in a single scoop, and it generally includes all the binary packages you can possibly install. Very few third party software providers release binary packages for OpenBSD, and not many more do for NetBSD or FreeBSD. So as long as you either use the binaries provided by those projects or those built by you on the same system, switching the provider is fairly easy.

When you have to support third-party binaries, then you have a big problem, because a given binary may be built against one provider, but depend on a library that depends on the other. So unless you have full control of your system, with no binary packages at all, you’re going to have to provide the most likely provider — which right now is OpenSSL, for good or bad.

Gentoo Linux is, once again, in a more favourable position than many others. As long as you have a full source stack, you can easily choose your provider without considering its popularity. I have built similar stacks before, and my servers deploy stacks similarly, although I have not tried using LibreSSL for any of them yet. But on the desktop it might be trickier, especially if you want to do things like playing Steam games.

But here’s the harsh reality, even if you were to install the libraries in different directories, and you would provide a USE flag to choose between the two, it is not going to be easy to apply the right constraints between final executables and libraries all the way into the tree.

I’m not sure if I have an answer to balance the ability to just make the old software use the new library and the side-installation. I’m scared that a “solution” that can be found to solve this problem is bundling and you can probably figure out that doing so for software like OpenSSL or LibreSSL is a terrible idea, given how fast you should update in response to a security vulnerability.

New devbox running

I announced it in February that Excelsior, which ran the Tinderbox, was no longer at Hurricane Electric. I have also said I’ll start on working on a new generation Tinderbox, and to do that I need a new devbox, as the only three Gentoo systems I have at home are the laptops and my HTPC, not exactly hardware to run compilation all the freaking time.

So after thinking of options, I decided that it was much cheaper to just rent a single dedicated server, rather than a full cabinet, and after asking around for options I settled for Online.net, because of price and recommendation from friends. Unfortunately they do not support Gentoo as an operating system, which makes a few things a bit more complicated. They do provide you with a rescue system, based on Ubuntu, which is enough to do the install, but not everything is easy that way either.

Luckily, most of the configuration (but not all) was stored in Puppet — so I only had to rename the hosts there, changed the MAC addresses for the LAN and WAN interfaces (I use static naming of the interfaces as lan0 and wan0, which makes many other pieces of configuration much easier to deal with), changed the IP addresses, and so on. Unfortunately since I didn’t start setting up that machine through Puppet, it also meant that it did not carry all the information to replicate the system, so it required some iteration and fixing of the configuration. This also means that the next move is going to be easier.

The biggest problem has been setting up correctly the MDRAID partitions, because of GRUB2: if you didn’t know, grub2 has an automagic dependency on mdadm — if you don’t install it it won’t be able to install itself on a RAID device, even though it can detect it; the maintainer refused to add an USE flag for it, so you have to know about it.

Given what can and cannot be autodetected by the kernel, I had to fight a little more than usual and just gave up and rebuilt the two (/boot and / — yes laugh at me but when I installed Excelsior it was the only way to get GRUB2 not to throw up) arrays as metadata 0.90. But the problem was being able to tell what the boot up errors were, as I have no physical access to the device of course.

The Online.net server I rented is a Dell server, that comes with iDRAC for remote management (Dell’s own name for IPMI, essentially), and Online.net allows you to set up connections to through your browser, which is pretty neat — they use a pool of temporary IP addresses and they only authorize your own IP address to connect to them. On the other hand, they do not change the default certificates, which means you end up with the same untrustable Dell certificate every time.

From the iDRAC console you can’t do much, but you can start up the remove, JavaWS-based console, which reminded me of something. Unfortunately the JNLP file that you can download from iDRAC did not work on either Sun, Oracle or IcedTea JREs, segfaulting (no kidding) with an X.509 error log as last output — I seriously thought the problem was with the certificates until I decided to dig deeper and found this set of entries in the JNLP file:

 <resources os="Windows" arch="x86">
   <nativelib href="https://idracip/software/avctKVMIOWin32.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLWin32.jar" download="eager"/>
 </resources>
 <resources os="Windows" arch="amd64">
   <nativelib href="https://idracip/software/avctKVMIOWin64.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLWin64.jar" download="eager"/>
 </resources>
 <resources os="Windows" arch="x86_64">
   <nativelib href="https://idracip/software/avctKVMIOWin64.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLWin64.jar" download="eager"/>
 </resources>
  <resources os="Linux" arch="x86">
    <nativelib href="https://idracip/software/avctKVMIOLinux32.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux32.jar" download="eager"/>
  </resources>
  <resources os="Linux" arch="i386">
    <nativelib href="https://idracip/software/avctKVMIOLinux32.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux32.jar" download="eager"/>
  </resources>
  <resources os="Linux" arch="i586">
    <nativelib href="https://idracip/software/avctKVMIOLinux32.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux32.jar" download="eager"/>
  </resources>
  <resources os="Linux" arch="i686">
    <nativelib href="https://idracip/software/avctKVMIOLinux32.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux32.jar" download="eager"/>
  </resources>
  <resources os="Linux" arch="amd64">
    <nativelib href="https://idracip/software/avctKVMIOLinux64.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux64.jar" download="eager"/>
  </resources>
  <resources os="Linux" arch="x86_64">
    <nativelib href="https://idracip/software/avctKVMIOLinux64.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLLinux64.jar" download="eager"/>
  </resources>
  <resources os="Mac OS X" arch="x86_64">
    <nativelib href="https://idracip/software/avctKVMIOMac64.jar" download="eager"/>
   <nativelib href="https://idracip/software/avctVMAPI_DLLMac64.jar" download="eager"/>
  </resources>

Turns out if you remove everything but the Linux/x86_64 option, it does fetch the right jar and execute the right code without segfaulting. Mysteries of Java Web Start I guess.

So after finally getting the system to boot, the next step is setting up networking — as I said I used Puppet to set up the addresses and everything, so I had working IPv4 at boot, but I had to fight a little longer to get IPv6 working. Indeed IPv6 configuration with servers, virtual and dedicated alike, is very much an unsolved problem. Not because there is no solution, but mostly because there are too many solutions — essentially every single hosting provider I ever used had a different way to set up IPv6 (including none at all in one case, so the only option was a tunnel) so it takes some fiddling around to set it up correctly.

To be honest, Online.net has a better set up than OVH or Hetzner, the latter being very flaky, and a more self-service one that Hurricane, which was very flexible, making it very easy to set up, but at the same time required me to just mail them if I wanted to make changes. They document for dibbler, as they rely on DHCPv6 with DUID for delegation — they give you a single /56 v6 net that you can then split up in subnets and delegate independently.

What DHCPv6 in this configuration does not give you is routing — which kinda make sense, as you can use RA (Route Advertisement) for it. Unfortunately at first I could not get it to work. Turns out that, since I use subnets for the containerized network, I enabled IPv6 forwarding, through Puppet of course. Turns out that Linux will ignore Route Advertisement packets when forwarding IPv6 unless you ask it nicely to — by setting accept_ra=2 as well. Yey!

Again this is the kind of problems that finding this information took much longer than it should have been; Linux does not really tell you that it’s ignoring RA packets, and it is by far not obvious that setting one sysctl will disable another — unless you go and look for it.

Luckily this was the last problem I had, after that the server was set up fine and I just had to finish configuring the domain’s zone file, and the reverse DNS and the SPF records… yes this is all the kind of trouble you go through if you don’t just run your whole infrastructure, or use fully cloud — which is why I don’t consider self-hosting a general solution.

What remained is just bits and pieces. The first was me realizing that Puppet does not remove the entries from /etc/fstab by default, so I noticed that the Gentoo default /etc/fstab file still contains the entries for CD-ROM drives as well as /dev/fd0. I don’t remember which was the last computer with a floppy disk drive that I used, let alone owned.

The other fun bit has been setting up the containers themselves — similarly to the server itself, they are set up with Puppet. Since the server used to be running a tinderbox, it used to also host a proper rsync mirror, it was just easier, but I didn’t want to repeat that here, and since I was unable to find a good mirror through mirrorselect (longer story), I configured Puppet to just provide to all the containers with distfiles.gentoo.org as their sync server, which did not work. Turns out that our default mirror address does not have any IPv6 hosts on it ­– when I asked Robin about it, it seems like we just don’t have any IPv6-hosted mirror that can handle that traffic, it is sad.

So anyway, I now have a new devbox and I’m trying to set up the rest of my repositories and access (I have not set up access to Gentoo’s repositories yet which is kind of the point here.) Hopefully this will also lead to more technical blogging in the next few weeks as I’m cutting down on the overwork to relax a bit.

TG4: Tinderbox Generation 4

Everybody’s a critic: the first comment I received when I showed other Gentoo developers my previous post about the tinderbox was a question on whether I would be using pkgcore for the new generation tinderbox. If you have understood what my blog post was about, you probably understand why I was not happy about such a question.

I thought the blog post made it very clear that my focus right now is not to change the way the tinderbox runs but the way the reporting pipeline works. This is the same problem as 2009: generating build logs is easy, sifting through them is not. At first I thought this was hard just for me, but the fact that GSoC attracted multiple people interested in doing continuous build, but not one interested in logmining showed me this is just a hard problem.

The approach I took last time, with what I’ll start calling TG3 (Tinderbox Generation 3), was to: highlight the error/warning messages; provide a list of build logs for which a problem was identified (without caring much for which kind of problem), and just showing up broken builds or broken tests in the interface. This was easy to build up, and to a point to use, but it had a lots of drawbacks.

Major drawbacks in that UI is that it relies on manual work to identify open bugs for the package (and thus make sure not to report duplicate bugs), and on my own memory not to report the same issue multiple time, if the bug was closed by some child as NEEDINFO.

I don’t have my graphic tablet with me to draw a mock of what I have in mind yet, but I can throw in some of the things I’ve been thinking of:

  • Being able to tell what problem or problems a particular build is about. It’s easy to tell whether a build log is just a build failure or a test failure, but what if instead it has three or four different warning conditions? Being able to tell which ones have been found and having a single-click bug filing system would be a good start.
  • Keep in mind the bugs filed against a package. This is important because sometimes a build log is just a repeat of something filed already; it may be that it failed multiple times since you started a reporting run, so it might be better to show that easily.
  • Related, it should collapse failures for packages so not to repeat the same package multiple times on the page. Say you look at the build failures every day or two, you don’t care if the same package failed 20 times, especially if the logs report the same error. Finding out whether the error messages are the same is tricky, but at least you can collapse the multiple logs in a single log per package, so you don’t need to skip it over and over again.
  • Again related, it should keep track of which logs have been read and which weren’t. It’s going to be tricky if the app is made multi-user, but at least a starting point needs to be there.
  • It should show the three most recent bugs open for the package (and a count of how many other open bugs) so that if the bug was filed by someone else, it does not need to be filed again. Bonus points for showing the few most recently reported closed bugs too.

You can tell already that this is a considerably more complex interface than the one I used before. I expect it’ll take some work with JavaScript at the very least, so I may end up doing it with AngularJS and Go mostly because that’s what I need to learn at work as well, don’t get me started. At least I don’t expect I’ll be doing it in Polymer but I won’t exclude that just yet.

Why do I spend this much time thinking and talking (and soon writing) about UI? Because I think this is the current bottleneck to scale up the amount of analysis of Gentoo’s quality. Running a tinderbox is getting cheaper — there are plenty of dedicated server offers that are considerably cheaper than what I paid for hosting Excelsior, let alone the initial investment in it. And this is without going to look again at the possible costs of running them on GCE or AWS at request.

Three years ago, my choice of a physical server in my hands was easier to justify than now, with 4-core HT servers with 48GB of RAM starting at €40/month — while I/O is still the limiting factor, with that much RAM it’s well possible to have one tinderbox building fully in tmpfs, and just run a separate server for a second instance, rather than sharing multiple instances.

And even if GCE/AWS instances that are charged for time running are not exactly interesting for continuous build systems, having a cloud image that can be instructed to start running a tinderbox with a fixed set of packages, say all the reverse dependencies of libav, would make it possible to run explicit tests for code that is known to be fragile, while not pausing the main tinderbox.

Finally, there are different ideas of how we should be testing packages: all options enabled, all options disabled, multilib or not, hardened or not, one package at a time, all packages together… they can all share the same exact logmining pipeline, as all it needs is the emerge --info output, and the log itself, which can have markers for known issues to look out for or not. And then you can build the packages however you desire, as long as you can submit them there.

Now my idea is not to just build this for myself and run analysis over all the people who want to submit the build logs, because that would be just about as crazy. But I think it would be okay to have a shared instance for Gentoo developers to submit build logs from their own personal instances, if they want to, and then have them look at their own accounts only. It’s not going to be my first target but I’ll keep that in mind when I start my mocks and implementations, because I think it might prove successful.