Documentation needs review tools, not Wikis

I’m a strong believer on documentation being a fundamental feature of open source, although myself I’m probably bad at following my own advice. While I do write down a lot of running notes on this blog, as I said before, blogs don’t replace documentation. I have indeed complained about how hard it seems to be to publish documentation that is not tied to a particular codebase, but there’s a bit more that I want to explore.

I have already discussed code reviews in the past few months — and pointing out how the bubble got me used to review tooling (back in the days this would be called CASE). The main thing that I care for, with these tools, is that they reduce the cost of the review, which makes it less likely that a patch is left aside for too long — say for three weeks, because one reviewer points out that the code you copied from one file to another is unsafe, and the other notes they missed it the first time around, but now it’s your problem to get it fixed.

In a similar spirit, “code reviews” for documentation are an incredibly powerful tool. Not just for the documentation quality, but also because of the inclusiveness of them. Let me explain, focusing in particular with documentation that is geared toward developers — because that’s what I know the most of. Product documentation, and documentation that is intended for end users, is something I have had barely any contact with, and I don’t think I would have the experience to discuss the matter.

So let’s say you’re looking a tool’s wiki page, and follow the instructions in it, but get a completely different result than you expected. You think you know why (maybe something has changed in one of the tool’s dependencies, maybe the operating system is different, or maybe it never worked in the first place), and you want to fix the documentation. If you just edit the wiki, and you’re right, you’re saving a lot of time and grief to the next person that comes over to the documentation.

But what happens if you’re wrong?Well, if you’re wrong you may be misinterpreting the instructions, and maybe give a bad suggestion to the next person coming over. You may be making the equivalent change of all the bad howto docs that say to just chmod 0777 /dev/something to make some software work — and the next person will find instructions that work, but open a huge gaping security hole into a software.

Do you edit the Wiki? Are you sure that there’s enough senior engineers knowing the tool that can notice you edited the wiki, and revert your change if it is wrong? You may know who has the answer, and decide to send them a note with the change “Hey, can you check if I did it right?” but what if they just went into a three weeks vacation? What if they end up in the hospital after writing about LED lights?

And it’s not just a matter of how soon someone might spot a mistaken edit. There’s the stress of not knowing (or maybe knowing) how such a mistake would be addressed. Will it be a revert with “No, you dork!”, or will it be an edit that goes into further details of what the intention was and what the correct approach should have been in the first place? Wikipedia is an example of something I don’t enjoy editing, despite doing it from time to time. I just find some of its policy absurdist — including having given me a hard time while trying to correct some editor’s incorrect understanding of my own project, while at the same time having found a minor “commercial open source” project having what I would call close to an advertisement piece, with all the references pointing at content written by the editor themselves — who happen to be the main person behind such project.

Review-based documentation systems – including Google’s g3doc, but also the “humble” Google Docs suggested edits! – alleviate this problem, particularly when you do provide a “fast path” for fixing obvious typos without going through the full review flow. But otherwise, they allow you to make your change, and then send it to someone who can confirm it’s right, or start discussing what the correct approach should be — and if you happen to be the person doing the review, be the rake collector, help clearing documentation!

Obviously, it’s not perfect — if all your senior engineers are jerks that would call names the newcomer making a mistake in documentation, the review would be just as stressful. But it gives a significant first mover advantage: you can (often) choose who to send the review to. And let’s be honest: most jerks are bullies, and they will be less likely to call names the newcomer, when they already got a sign off from another senior person.

This is not where it ends, either. Even when you are a senior engineer, or very well acquainted with a certain tool, you may still want to run documentation changes through someone else because you’re not sure how they will be read. For me, this often is related to the fact that English is not my native language — I may say something in such a way that is, in my head, impossible to misunderstand, and yet confuse everybody else reading it, because I’m using specialised terms, uncommon words, or I keep insisting on using a word that doesn’t mean what I think it means.

As an aside, if you read most of my past writing, you may have noticed I keep using the word sincerely when I mean honestly or truthfully. This is a false friend from Italian, where sincero means truthful. It’s one particular oddity that I was made aware of and tried very hard to get rid of, but still goes through at times. For the same reason, I tend to correct other people with the same oddity, as I trained myself to notice it.

And while non-native English speakers may think of this problem more often, it’s not to say that none of the English native speakers pay attention to this, or that they shouldn’t have someone else read their documentation first. In particular, when writing a tutorial it is necessary to get someone towards who it is targeted to read through it! That means someone who is not acquainted yet with the tool, because they will likely ask you questions if you start using terms that they never heard before, but are to you completely obvious.

Which is why I insist that having documentation in a reviewable (not necessarily requiring a review) repository, rather than a Wiki is an inclusiveness issue: it reduces the stress for newcomers, non-native English speakers, less aggressive people, and people who might not have gone to schools with debating clubs.

And at the same time, it reduces the risk that security-hole-enabling documentation is left, even for a little while, unreviewed but live. Isn’t that good?

Falsehoods in Tutorials: Database Schemas

It’s well possible that a number of people reading this post have already stumbled across a few of the “Falsehoods Programmers Believe…” documents. If not, there appears to be a collection of them, although I have honestly only read through the ones about names, addresses, and time. The short version of all of this, is that interfacing software with reality is complicated, and in many cases, programmers don’t know how complicated it is at all. And sometimes this turns into effectively institutional xenophobia.

I have already mused that tutorials and documentation are partially to blame, by spreading code memes and reality-hostile simplifications. But now I have some more evidence of this being the case, without me building an explicit strawman like I did last time, and that brings me to another interesting point, in regards to the raising importance of getting stuff right beforehand, as costs to correct these mistakes are raising.

You see, with lockdown giving us a lot of spare time, I spent some of it on artsy projects and electronics, while my wife spent it learning about programming, Python, and more recently databases. She found a set of tutorials on YouTube that explain the basis of what a database is, and how SQL works. And they were full of those falsehoods I just linked above.

The tutorials use what I guess is a fairly common example of using a database for employees, customers, and branches of a company. And it includes in the example the fields for first name and last name. Which frankly is a terrible mistake — with very few exception that include banks and airlines, there’s no need to distinguish between components of a name, and a simple full name field would work just as well, and don’t end up causing headaches to people from cultures that don’t split names the same way. The fact that I recently ranted about this on Twitter against VirusTotal is not totally coincidental.

It goes a bit beyond that though, by trying to explain ON DELETE triggers by attaching them to the deletion of an employee from the database. Now, I’m not a GDPR lawyer, but it’s my understanding that employee rosters are one of those things that you’re allowed to keep for essential business needs — and you most likely don’t want to ever delete employees, their commissions payment history, and tax records.

I do understand that a lot of tutorials need to be using simple examples, as setting up a proper HR-compatible database would probably take a lot more time, particularly with compartmentalizing information so that your random sales analyst don’t have access to the home phone numbers of their colleagues.

I have no experience with designing employee-related database schemas, so I don’t really want to dig myself into a hole I can’t come out of, by running with this example. I do have experience with designing database schemas for product inventory, though, so I will run with that example. I think it was a different tutorial that was talking about those, but I’ll admit I’m not sure, because I didn’t pay too much attention as I was getting annoyed at the quality.

So this other tutorial focused on products, orders and sales total — its schema was naïve and not the type of databases any real order history system would use — noticeably, it assumed that an order would just need to connect with the products, with the price attached to the product row. In truth, most databases like those would need to attach the price for which an item was sold to the order — because products change prices over time.

And at the same time, it’s fairly common to want to keep the history of price changes for an item, which include the ability to pre-approve time-limited discounts, so a table of products is fairly unlikely to have the price for each item as a column. Instead, I’ve commonly seen these database to have a prices table that references the items, and provides start and end dates for the price. This way, it’s possible to know at any time what is the “valid price” for an item. And as some of my former customers had to learn on their own, it’s also important to separate which VAT is used at which time.

Example ER diagram showing an example of a more realistic shop database.

There are five tables. * indicates the primary key.

Order (*ID, Customer_ID, Billing_Address, Shipping_Address)
Order_Products(*Order_ID, *Product_ID, Gross_Price, VAT_Rate)
Product(*ID, Name)
Product_VAT(*Product_ID, *Start_Date, End_Date, VAT_Rate)
Product_ID(*Product_ID, *Start_Date, End_Date, Gross_Price)

This is again fairly simplified. Most of the shopping systems you might encounter use what might appear redundant, particularly when you’re taught that SQL require normal form databases, but that’s just in theory — practice is different. Significantly so at times.

Among other things, if you have an online shop that caters to multiple countries within the European Union, then your table holding products’ VAT information might need to be extended to include the country for each one of them. Conversely, if you are limited to accounting for VAT in a single country you may be able to reduce this to VAT categories — but keep in mind that products can and do change VAT categories over time.

Some people might start wondering now why would you go through this much trouble for an online store, that only needs to know what the price is right now. That’s a good point, if you happen to have multiple hundreds’ megabytes of database to go through to query the current price of a product. In the example above you would probably need a query such as

SELECT Product.ID, Product.Name, Product_Price.Gross_Price, Product_VAT.VAT_Rate
FROM Product
  LEFT JOIN Product_Price ON Product_Price.Product_ID = Product.ID
  LEFT JOIN Product_VAT ON Product_VAT.Product_ID = Product.ID
WHERE
  Product.ID = '{whatever}' AND
  Product_Price.Start_Date <= TODAY() AND
  Product_Price.End_Date > TODAY() AND
  Product_VAT.Start_Date <= TODAY() AND
  Product_VAT.End_Date > TODAY();

It sounds like an expensive query, doesn’t it? And it seems silly to go and scan the price and VAT tables all the time throughout the same day. It also might be entirely incorrect, depending on its placement — I do not know the rules of billings, but it may very well be possible that an order be placed close to a VAT change boundary, in which case the customer could have to pay the gross price at the time of order, but the VAT at shipping time!

So what you do end up using in many places for online ordering is a different database. Which is not the canonical copy. Often the term used for this is ETL, which stands for Extract, Transform, Load. It basically means you can build new, read-only tables once a day, and select out of those in the web frontend. For instance the above schema could be ETL’d to include a new, disconnected WebProduct table:

The same ER diagram as before, but this time with an additional table:

WebProduct(*ID, *Date, Name, Gross_Price, VAT_Rate)

Now with this table, the query would be significantly shorter:

SELECT ID, Name, Gross_Price, VAT_Rate
FROM WebProduct
WHERE ID = '{whatever}' AND Date = TODAY();

The question that comes up with seeing this schema is “Why on Earth do you have a Date column as part of the primary key, and why do you need to query for today’s date?” I’m not suggesting that the new table is generated to include every single day in existence, but it might be useful to let an ETL pipeline generate more than just one day’s worth of data — because you almost always want to generate today’s and tomorrow’s, that way you don’t need to take down your website for maintenance around midnight. But also, if you don’t have any expectation that prices will fluctuate on a daily basis, it would be more resource-friendly to run the pipeline every few days instead of daily. It’s a compromise of course, but that’s what system designing is there for.

Note that in all of this I have ignored the issue of stock. That’s a harder problem, and one that might not actually be suited to be solved with a simple database schema — you need to come to terms with compromises around availability and the fact that you need a single source of truth for how many items you’re allowed to sell… consistency is hard.

Closing my personal rant on database design, there’s another problem I want to point a spotlight to. When I started working on Autotools Mythbuster, I explicitly wanted to be able to update the content, quickly. I have had multiple revisions of the book on the Kindle Store and Kobo, but even those lagged behind the website a few times. Indeed, I think the only reason why they are not lagging behind right now is that most of the changes on the website in the past year or two have only been cosmetics, and not applying to ePub.

Even for a project like that, which uses the same source of truth for the content, there’s a heavy difference in the time cost of updating the website rather than the “book”. When talking about real books, that’s an even bigger cost — and that’s without going into the print books realm. Producing content is hard, which is why I realised many years ago that I wouldn’t have the ability to carve out enough time to make a good author.

Even adding diagrams to this blog post has a slightly higher cost than just me ranting “on paper”. And that’s why sometimes I could add more diagrams with my ideas, but I don’t, because the cost of producing it, and keeping it current would be too high. The Glucometers Protocols site as a few rough diagrams, but they are generated with blockdiag so that they can be edited quickly.

When it comes to online tutorial, though, there’s an even bigger problem: the possibly vast majority of them are, nowadays, on YouTube, as videos shot with a person in frame, to be more like a teacher in a classroom, that can explain things. If something in the video is only minimally incorrect, it’s unlikely that those videos would be re-shot — it would be an immense cost in time. Also, you can’t just update a YouTube video like you do a Kindle book — you lose comments, likes, view-counts, and those things matter for monetization, which is what most of those tutorials out there are made for. So unless the mistakes in a video-tutorial are Earth-shattering, it’s hard to expect the creators to go and fix them.

Which is why I think that it’s incredibly important to get the small things right — Stop using first and last name fields in databases, objects, forms, and whatever else you are teaching people to make! Think a bit harder as for how a product inventory database would look like! Be explicit in pointing out that you’re simplifying to an extreme, rather than providing a real-world-capable design of a database! And maybe, just maybe, start using examples that are ridiculous enough that they don’t risk being used by a junior developer in the real world.

And let me be clear on this: you can’t blame junior developers for making mistakes such as using a naïve database schema, if that’s all they are taught! I have been saying this at previous dayjob for a while: you can’t complain about the quality of code of newbies unless you have provided them with the right information in the documentation — which is why I spent more time than average on example code, and tutorials, to fix up trimmings and make it easier to copy-paste the example code into a working change that follows best practices. In the words of a colleague wiser than me: «Example code should be exemplar.»

So save yourself some trouble in the future, by making sure the people that you’re training get the best experience, and can build your own next tool to the best of specs.

Publishing Documentation

I have been repeating for years that blogs are not documentation out of themselves. While I have spent a lot of time over the years to make sure that my blog’s links are not broken, I also know that many of my old blog posts are no longer relevant at all. The links out of the blog can be broken, and it’s not particularly easy to identify them. What might have been true in 2009 might not be true in 2020. The best option for implementing something has likely changed significantly, given how ten years ago, Cloud Computing was barely a thing on the horizon, and LXC was considered an experiment.

This is the reason why Autotools Mythbuster is the way it is: it’s a “living book” — I can update and improve it, but at the same time it can be used as a stable reference of best practices: when they change it gets updated, but the link is still a pointer to the good practice.

At work, I pretty much got used to “Radically Simple Documentation” – thanks to Riona and her team. Which pretty much means I only needed to care about the content of the documentation, rather than dealing with how it would render, either in terms of pipeline or style.

And just like other problems with the bubble, when I try to do the same outside of it, I get thoroughly lost. The Glucometer Protocols site had been hosted as GitHub pages for a few years by now — but I now wanted to add some diagrams, as more modern protocols (as well as some older, but messier, protocols) would be much simpler to explain with UML Sequence Diagrams to go with.

The first problem was of course to find a way to generate sequence diagrams out of code that can be checked-in and reviewed, rather than as binary blobs — and thankfully there are a few options. I settled for blockdiag because it’s the easiest to set up in a hurry. But it turned out that integrating it is far from easy as it would seem.

While GitHub pages uses Jekyll, it uses such an old version that reproducing that on Netlify is pretty much impossible. Most of the themes that are available out there are mostly dedicated to personal sites, or ecommerce, or blogs — and even when I found one that seemed suitable for this kind of reference, I couldn’t figure out how to to get the whole thing to work. And it didn’t help that Jekyll appears to be very scant on debug logging.

I tried a number of different static site generators, including a few in JavaScript (which I find particularly annoying), but the end result was almost always that they seemed more geared towards “marketing” sites (in a very loose sense) than references. To this moment, I miss the simplicity of g3doc.

I ended up settling for Foliant, which appears to be more geared towards writing actual books than reference documentation, but wraps around MkDocs, and it provides a plugin that integrates with Blockdiag (although I still have a pending pull request to support more diagram types). And with a bit of play around it, I managed to get Netlify to build this properly and serve it. Which is what you get now.

But of course, since MkDocs (and a number of other Python-based tools I found) appear to rely on the same Markdown library, they are not even completely compatible with the Markdown as written for Jekyll and GitHub pages: the Python implementation is much stricter when it comes to indentation, and misses some of the feature. Most of those appear to have been at some point works in progress, but there doesn’t seem to be much movement on the library itself.

Again, these are relatively simple features I came to expect for documentation. And I know that some of my (soon-to-be-former) colleagues have been working on improving the state of opensource documentation frameworks, including Lisa working on Docsy, which looks awesome — but relies on Hugo, which I still dislike, and seems to have taken a direction which is going further and further away from me (the latest when I was trying to set this up is that to use Hugo on Linux they now seem to require you to install Homebrew, because clearly having something easy for Linux packagers to work with is not worth it, sigh).

I might reconsider that, if Hugo finds a way to implement building images out of other tools, but I don’t have strong expectations that the needs for documentation reference would be considered for future updates to Hugo, given how it was previously socialized as a static blog engine, only to pivot to needs that would make it more “marketable”.

I even miss GuideXML, to a point. This was Gentoo’s documentation format back in the days before the Wiki. It was complex, and probably more complicated than it should have been, but at least the pipeline to generate the documentation was well defined.

Anyhow, if anyone out there has experience in setting up reference documentation sites, and wants to make it easier to maintain a repository of information on glucometers, I’ll welcome help, suggestions, pull requests, and links to documentation and tools.

Are tutorials to blame for basic IT problems?

It’s now effectively impossible to spend a month following IT (and not just) new and not hear of breaches, “hacks”, or general security fiascos. Some of these are tracked down to very basic mistakes in configuration or coding of software, including the lack of hashing of passwords in database. Everyone in the industry, including me, have at some point expressed the importance of proper QA and testing, and budgeting for them in the development process. But what if the problem is much higher up the chain?

Falsehoods Programmers Believe About Names is now over seven years old, and yet my just barely complicated full name (first name with a space in it, surname with an accent) can’t be easily used by most of the services I routinely use. Ireland was particularly interesting, as most services would support characters in the “Latin extended” alphabet, due to the Irish language use of ó, but they wouldn’t accept my surname, which uses ò — this not a first, I had trouble getting invoices from French companies before because they only know about ó as a character.

In a related, but not directly connected topic, there are the problems an acquaintance of mine keeps stumbling across. They don’t want service providers to attach a title to their account, but it looks like most of the developers that implement account handling don’t actually think about this option at all, and make it hard to not set a honorific at all. In particular, it appears not only UIs tend to include a mandatory drop-down list of titles, but the database schema (or whichever other model is used to store the information) also provides the title as an enumeration within a list — that is apparent by the way my acquaintance has had their account reverted to a “default” value, likely the “zeroth” one in the enumeration.

And since most systems don’t end up using British Airways’s honorific list but are rather limited to the “usual” ones, that appears to be “Ms” more often than not, as it sorts (lexicographically) before the others. I have had that happen to me a couple of times too, as I don’t usually file the “title” field on paper forms (I never seen much of a point of it), and I guess somewhere in the pipeline a model really expects a person to have a title.

All of this has me wondering, oh-so-many times, why most systems appear to want to store a name in separate entries for first and last name (or variation thereof), and why they insist on having a honorific title that is “one of the list” rather than a freeform (which would accept the empty string as a valid value). My theory on this is that it’s the fault of the training, or of the documentation. Multiple tutorials I have read, and even followed, over the years defined a model for a “person” – whether it is an user, customer, or any other entity related to the service itself – and many of these use the most basic identifying information about a person as fields to show how the model works, which give you “name”, “surname”, and “title” fields. Bonus points to use an enumeration for the title rather than a freeform, or validation that the title is one of the “admissible” ones.

You could call this a straw man argument, but the truth is that it didn’t take me any time at all to find an example tutorial (See also Archive.is, as I hope the live version can be fixed!) that did exactly that.

Similarly, I have seen sample tutorial code explaining how to write authentication primitives that oversimplify the procedure by either ignoring the salt-and-hashing or using obviously broken hashing functions such as crypt() rather than anything solid. Given many of us know all too well how even important jobs that are not flashy enough for a “rockstar” can be pushed into the hands of junior developers or even interns, I would not be surprised if a good chunk of these weak authentication problems that are now causing us so much pain are caused by simple bad practices that are (still) taught to those who join our profession.

I am afraid I don’t have an answer of how to fix this situation. While musing, again on Twitter, the only suggestion for a good text on writing correct authentication code is the NIST recommendations, but these are, unsurprisingly, written in a tone that is not useful to teach how to do things. They are standards first and foremost, and they are good, but that makes them extremely unsuitable for newcomers to learn how to do things correctly. And while they do provide very solid ground for building formally correct implementations of common libraries to implement the authentication — I somehow doubt that most systems would care about the formal correctness of their login page, particularly given the stories we have seen up to now.

I have seen comments on social media (different people on different media) about what makes a good source of documentation changes depending on your expertise, which is quite correct. Giving a long list of things that you should or should not do is probably a bad way to introduce newcomers to development in general. But maybe we should make sure that examples, samples, and documentation are updated so that they show the current best practice rather than overly simplified, or artificially complicated (sometimes at the same time) examples.

If you’re writing documentation, or new libraries (because you’re writing documentation for new libraries you write, right?) you may want to make sure that the “minimal” example is actually the minimum you need to do, and not skip over things like error checks, or full initialisation. And please, take a look at the various “Falsehoods Programmers Believe About” lists — and see if your example implementation make those assumptions. And if so fix them, please. You’ll prevent many mistakes from happening in real world applications, simply because the next junior developer who gets hired to build a startup’s latest website will not be steered towards the wrong implementations.

Reverse Engineering is just the first step

Last year I said that reverse engineering obsolete systems is useful giving as an example adding Coreboot support for very old motherboards that are simpler and whose components are more likely to have been described somewhere already. One thing that I realized I didn’t make very clear in that post is that there is an important step on reverse engineering: documenting. As you can imagine from this blog, I think that documenting the reverse engineering processes and results are important, but I found out that this is definitely not the case for everybody.

On the particularly good side, going to 33c3 had a positive impression on me. Talks such as The Ultimate GameBoy Talk were excellent: Michael Steil did an awesome job at describing a lot of the unknown details of Nintendo’s most popular handheld. He also did a great job at showing practical matters, such as which tricks did various games use to implement things that at first sight would look impossible. And this is only one of his talks, he has a series that is going on year after year, I’ve watched his talk about the Commodore 64, and the only reason why it’s less enjoyable to watch is that the recording quality suffers from the ages.

In other posts I already referenced Micah’s videos. These have also been extremely nice to start watching, as she does a great job at explaining complex concepts, and even the “stream of consciousness” streams are very interesting and a good time to learn new tricks. What attracted me to her content, though, is the following video:

I have been using Wacom tablets for years, and I had no idea how they really worked behind the scene. Not only she does a great explanation of the technology in general, but the teardown of the mouse was also awesome with full schematics and explanation of the small components. No wonder I have signed up for her Patreon right away: she deserve to be better known and have a bigger following. And if funding her means spreading more knowledge around, well, then I’m happy to do my bit.

For the free software, open source and hacking community, reverse engineering is only half the process. The endgame is not for one person to know exactly how something works, but rather for the collectivity to gain more insight on things, so that more people have access to the information and can even improve on it. The community needs not only to help with that but also to prioritise projects that share information. And that does not just mean writing blogs about things. I said this before: blogs don’t replace documentation. You can see blogs as Micah’s shop-streaming videos, while documentation is more like her video about the tablets: they synthesize documentation in actually usable form, rather than just throwing information around.

I have a similar problem of course: my blog posts are not usually a bit of a stream of consciousness and they do not serve an useful purpose to capture the factual state of information. Take for example my post about reverse engineering the OneTouch Verio and its rambling on, then compare it with the proper protocol documentation. The latter is the actual important product, compared to my ramblings, and that is the one I can be proud of. I would also argue that documenting these things in a easily consumable form is more important than writing tools implementing them as those only cover part of the protocol and in particular can only leverage my skills, that do not involve statistical, pharmaceutical or data visualisation skills.

Unfortunately there are obstacles to these idea of course. Sometimes, reverse engineering documentation is attacked by manufacturer even more than code implementing the same information. So for instance while I have some information I still haven’t posted about a certain gaming mouse, I already know that the libratbag people do not want documentation of the protocols in their repository or wiki, because it causes them more headaches than the code. And then of course there is the problem of hosting this documentation somewhere.

I have been pushing my documentation on GitHub, hoping nobody causes a stink, but the good thing about using git rather than Wiki or similar tools is exactly that you can just move it around without losing information. This is not always the case: a lot of documentation is still nowadays only available either as part of code itself, or on various people’s homepages. And there are at least two things that can happen with that, the first is the most obvious and morbid one: the author of the documentation dies, and the documentation disappears once their domain registration expires, or whatever else, or if the homepage is at a given university or other academic endeavour, it may very well be that the homepage gets to disappear before the person anyway.

I know a few other alternatives to store this kind of data have been suggested, including common wiki akin to Wikipedia, but allowing for original research, but I am still uncertain that is going to be very helpful. The most obvious thing I can think of, is making sure these information can actually be published in books. And I think that at least No Starch Press has been doing a lot for this, publishing extremely interesting books including Designing BSD Rootkits and more recently Rootkits and Bootkits which is still in Early Access. A big kudos to Bill for this.

From my side, I promise I’ll try to organize my findings of anything I’ll work on in the best of my ability, and possibly organize it in a different form than just a blog, because the community deserves better.

Testing stable; stable testing

It might not be obvious but my tinderbox is testing the “unstable” ~x86 keyword of Gentoo. This choice was originally due to the kind of testing (the latest and greatest version of packages for which we don’t know the averse effects of), and I think it has helped tremendously, up to now, to catch things that could have otherwise bit people months, if not years later, especially for the least-used packages. Unfortunately that decision also meant ignoring for the most part the other, now more common architecture (amd64 — although I wonder if the new, experimental amd32 architecture would take its place), as well as ignoring the stable branch of Gentoo, which is luckily tracked by other people from time to time.

Lacking continuous testing though, what is actually considered stable, sometimes it’s not very much so. This problem is further increased by the fact that sometimes the stable requests aren’t proper by themselves: it can be an user asking to stable a package, and the arch teams being called before they should have, it might be an overseen issue that the maintainer didn’t think of, or it might simply be that multiple maintainers had different feelings about stabilisation, which happens pretty often in team-maintained packages.

Whatever the case, once a stable request has been filed, it is quite rare that issues are brought up that are severe enough the stable is denied: it might be that the current stable is in a much worse shape, or maybe it’s a security stable request, or a required dependency of something following one of those destinies. I find this a bit unfortunate, I’m always happy when issues are brought up that delay a stable, even if that sometimes means having to skip a whole generation of a package, just as long as they mean having a nicer stable package.

Testing a package is obviously a pretty taxing task: we have numbers of combination of USE flags, compiler and linker flags, features, and so on so forth. For some packages, such as PHP, testing every and each combination of USE flags is not only infeasible, but just impossible within this universe’s time. to make the work of the arch team developers more bearable, years ago, starting with the AMD64 team, the concept of Arch Testers was invented: so-called “power users” whose task is to test the package and give green light for stable-marking.

At first, all of the ATs were as technically capable as a full-fledged developer, to the point that most of the original ones (me included) “graduated” to full devship. With time this requirement seems to have relaxed, probably because AMD64 became usable to the average user, and no longer just to those with enough knowledge to workaround the landmines originally present when trying to use a 64-bit system as a desktop — I still remember seeing Firefox fail to build because of integer and pointer variable swaps, once that became an error-worthy mistake in GCC 3.4. Unfortunately this also meant that a number of non-apparent issues became even less apparent.

Most of these issues often end up being caused by one simple fault: lack of direction in testing. For most packages, and that includes, unfortunately, a lot of my packages, the ebuild’s selftests are nowhere near comprehensive enough to tell whether the package works or not, and unless the tester actually uses the package, there is little chance that he knows really how to test it. Sure that covers most of the common desktop packages and a huge number of server packages, but that’s far from the perfect solution since it’s the less-common packages that require more eyes on the issues.

What the problem is, with most other software, is that it might require specific hardware, software or configuration to actually be tested. And most of the time, it requires a few procedures to be applied to ensure that it is actually tested properly. At the same time, I only know of Java and Emacs team publishing proper testing procedures for at least a subset of their packages. Most packages could use such a documentation, but it’s not something that maintainers, me included, are used to work on. I think that one of the most important tasks here, is to remind developers when asking stable to come up with a testing plan. Maybe after asking a few times, we’ll get to work and write up that documentation, once and for all.

PAM, logging in and changing passwords

I’ve been spending the past ten days/two weeks handling two full-time job at once; one was Windows-related so it won’t have any direct effect in what I’d be posting on the blog, the other involved Amazon EC2, so you’ll be seeing more rants sorry I meant posts on the topic soon. But first, …

Thanks to Constanze who became a full-fledged developer (congratulations!), I’ve been able to breath a bit more widely for what concerns PAM; another positive note comes from Eray becoming developer as well, which means I can get someone looking at pam_krb5 package. Which means I can get back to work on the M4-powered pambase package so that hopefully before end of the year we’re going to get it in testing at least. Additionally, user prometheanfire on #gentoo-hardened provided me with a sample configuration for LDAP that should make it much easier to implement it on pambase.

But the situation starts to become much more complicated; for instance, the ConsoleKit situation is so intricated that making it behave as intended is actually quite difficult: the invocation of the module is different whether we’re going to authenticate a text login or an X11 login session; some time ago we also found the hard way that some graphical login managers fail badly when you print too much information on the PAM output channel (such as Messages of the Day, the last login data, and mail status). This all results in having to have different sessions for local text and local graphical logins. I have a huge headache already when I start to think about XDMCP already.

This turn of events also makes me think that I should simply drop the system-login service that I’ve used in the previous iterations. The reason to use and include this service was to avoid duplication, but with M4, duplication is avoided during build time, not after install. This should make available only the three “leaf” services: system-remote-login (with optional ABL support), system-local-login (not renamed for compatibility reasons) with text-based login, and (by default) mail/motd/lastlogin modules; system-graphical-login with support for X11-based ConsoleKit sessions as well as without the extra verbose modules.

A note here: somebody asked me why of the minimal USE flag for pambase; the reason is relatively simple: even though the output of those can easily be discarded, they will be kept loaded in memory by processes such as sshd and fcron; dropping the modules from the services mean also reducing the memory usage of those process, minimally, but it does.

After the login process is sorted out there is another problem here and it has to do with changing passwords; I’ve said that before, but I’ll repeat it here. When the new pambase will be put in place, software that is able to change password will have to be updated to use a different service to do so; this will hinder the changing of password through sshd that was noted in the comments of one of my previous posts, but it is necessary if we want to have proper restriction among login methods.

The problem is that with PAM design, for what concern changing passwords, you end up with either you have to know all the currently in-use authentication methods or you have to know only one of the authentication methods and then you change all the authentication method to the new value or you change only one authentication method to the new value.

The end result is that I can’t think of any way to do what would make sense: change the token only for the systems that actually use the current password provided. Lacking this the situation is that we cannot have a single tool to do everything, so we’re going to have to stick with many different password-changing tools: passwd, chpasswd and their cousins will only require the Unix password and will only change the Unix password. You’re going to use separate tools for Kerberos, LDAP, SSH keys, PKCS#11 tokens, …

While it might sound as suboptimal it’s a compromise that actually make pambase manageable without having to resort to actual custom Linux-PAM implementations. I hope you can all agree on that.

Anyway, this only acts as a braindump; I hope I’ll be able to set up real documentation about the pambase system at one point or another, including some simple drawing to show how the authentication flow actually happens. Unfortunately if you remember, I noted that OpenOffice is the only decent software I can find to write flowcharts; unfortunately that is both cumbersome to add to a GIT repository, cumbersome to auto-produce results (when what it exports is what you wanted), and finally quite expensive in term of dependencies. I should probably try Inkscape back, possibly tied with rsvg (now that gdk-pixbuf works without X) would be a decent choice.

Debunking ccache myths redux

Since my original post from two years ago didn’t reach yet all the users, and some of the developers as well, I would like to reiterate that you should not be enabling ccache unconditionally.

It seems like our own (Gentoo’s) documentation is still reporting that using ccache makes build “10 to 5 times faster”. I’ll call this statement for what it is: bullshit. The rebuild of the same package might have such a hit, but not the normal emerge process of a standard user with Gentoo. If anything at all, the use of ccache will slow your build down, and even add further failure cases and make it difficult to identify errors.

Now, since the approach last time might not have been clear enough, let me try a different one, by describing the steps it takes when you call it:

  • it has to parse the commandline to make sure you’re calling it for a single compile, it won’t do any good if you’re using it to link, or to build multiple source files at once (you can, especially if you use -fwhole-program, but that’s for another day to write about), so in those cases, the command is passed through to the compiler itself;
  • once it knows that it’s doing a single compile, it changes the call to the compiler so that instead it simply preprocess the file, and stores the result in a temporary area;
  • now it’s time to hash the data, with md4 (the parent of MD5), that as the man page suggests is a strong hash; this has good reasons to be strong, but it also means that it takes some time to hash the content; we’re not talking about the source files themselves, that are usually very small and thus quick to hash, but rather of the preprocessed file, which includes all the headers used… a quick example on my system, by just including eight common header files, produces a 120KB output (with -O2 and _FORTIFY_SOURCE… it goes down to 93KB if -O0 is used); to that add the extra information that ccache has to save (check the man pages for those);
  • now it has to search the filesystem, within its cache directory, if there is a file with the same md4; if there is, it gets either copied (or experimentally hardlinked, but let’s not go there for now), otherwise the preprocessed file is compiled and copied in the cache instead; in either case, it involves copying the object file from one side to the other.

Now, we can identify three main time-consuming operations: preprocessing, hashing and copying; all of them are executed whether this is a hit or a miss; if it’s a miss you add to that the actual build. How do they fare about the kind of resources used? Hashing, just like compiling, is a CPU-intensive operation; preprocessing is mixed (you got to read the header files from around the disk); copying is I/O-intensive. Given that nowadays most systems have multiple CPU and find themselves slowing down on I/O (the tinderbox taught me that the hard way), the copying of files around is going to slow down the build quite a bit. Even more so when the hit-to-miss ration is high. The tinderbox, when rebuilding the same failing packages over and over again (before I started masking the packages that failed at any given time), had a 40% hit-to-miss ratio and was slowed down by using ccache.

Now, as I already wrote, there is no reason to expect that the same exact code is going to be rebuilt so often on a normal Gentoo system… even if minor updates to the same package were to share most of the source code (without touching the internal header files), for ccache to work you’d have to leave untouched compiler, flags, and all the headers of all the dependent libraries… and this often includes the system header files from linux-headers. And even if all these conditions were to hold true, you’d have to have rebuilt object files for a total size smaller than the cache size, in-between, or the objects would have had expired. If you think that 2GB is a lot, think again, if you were to use -ggdb especially.

Okay now there are some cases where you might care about ccache because you are rebuilding the same package; that includes patch-testing and live ebuilds. In these cases you should not simply set FEATURES=ccache, but you can instead make use of the per-package environment files. You can then choose two options: you can do what Portage does (setting PATH so that the ccache wrappers are found before the compilers themselves) or you can simply re-set the CC variable, such as export CC="ccache gcc". Just set it in /etc/portage/env/$CATEGORY/$PN and you’re done.

Now it would be nice if our terrific Documentation team – instead of deciding once again (the last time was with respect to alsa-drivers) that they know better what the developers should support – would understand that stating in the handbook that ccache somehow magically makes normal updates “5 to 10 times faster” is foolish and should be avoided. Unfortunately upon my request the answer hasn’t been what you’d expect from logic.

Autotools Mythbuster: Indexed!

Since there has been talking about Autotools today, and at least on the Reddit comments, my Autotools Guide got linked, I decided to take a few minutes of my time and extended the guide a bit further. I was already doing so to document the automake options (I was actually aiming at documenting the flavors and in particular the foreign mode so I would stop finding 0-sized NEWS files around), but this time I tried to make it a bit more searchable…

So right now there is a new page with the terms index and I shortened the table of concepts so that it more easily flow in the browser. The titles should be quite explaining of where to end up to. Right now I only added a single index for terms, even though I considered splitting them down per macro or variable, similarly to how it’s done in the official documentation, but for now this should do. I did add a “commons error” primary term though, as that should make it easier to find the common errors that the various tools report which I covered.

Now, these are the good news, here come the bad news though. Quite a while after first publication, the guide still is lacking a lot and my style hasn’t particularly improved. I’m not sure how good it can become by this pace. On the other hand, I’m still open to receiving requests and answering them there (thanks to Fabio asking about it, there’s now a whole section about pkg-config although it does not cover the -uninstalled variant that I use(d) on lscube so much).

Contributions in both corrections, general improvements or even just ideas are very welcome; so are donations or, more interestingly nowadays, flattr clicks (thanks to Sebastian for giving me an invite!). There is a flattr button at the bottom of the Autotools Mythbuster pages… if it is going to help you, a flattr, little as it can be, is going to show your appreciation in a way that reminds me why I started working on it.

There are going to be more news related to the guide in the future anyway, and a few more related to autotools in general, sweet news for some of you, slightly less sweet for me… so keep yourself seated, the journey is still on!

QA by disagreement

A few months ago I criticised Qt team for not following QA indications regarding the installation of documentation. Now, I wish to apologize and thank them: they were tremendously useful in identifying one area of Gentoo that needs fixes in both the QA policy and the actual use of it.

The problem: the default QA rules, and the policies encoded in both the Ebuild HOWTO (that should be deprecated!) and Portage itself, is to install the documentation of packages into /usr/share/doc/${PF}, a path that changes between different revisions of the same package. Some packages currently don’t respect that; some because it wasn’t thought about; some because they were mistakenly bound to ${P} on a zero-revision ebuild; some because they need not respect that paths.

When I started reporting for wrongly-installed documentation, I wasn’t expecting any in the latter category, it turns out to find so many examples of the latter; and a further number of examples and use cases that would call to change that policy altogether:

  • package foo requires to know where package bar installs its documentation, so that it can load it up, for whatever reason that is; this requires bar to either symlink its documentation somewhere or break the current policy, or otherwise you’d have to rebuild foo each time to find the correct path, which is not feasible;
  • package baz requires to know where its own documentation is installed to be able to access it at runtime; this either requires it to be hardcoded in the sources or to write it in a configuration file requiring semi-manual merge through etc-config; this is the case of Postfix for instance;
  • probably most important for users, API documentation bookmarks, right now, cannot be made stale unless you use symlinks; this is very annoying for the people who use those packages to develop (and you might guess that my main target here would be Ruby gems).

The solution: not sure if I can say I have a solution, but Samuli and Ulrich proposed a number of possible alternatives to solve the problem; from their suggestions I’d say we have to encode exactly three informations. The category of the package, the package name, and the slot of the package itself — the category is needed because there are a number of packages with the same name and different categories… sometimes even with the same version (dev-php5 and the old dev-php4 categories are a good example of those, and they were systematically breaking the policy stated above).

One solution I was proposed was /usr/share/doc/${CATEGORY}_${PN}-${SLOT} which wouldn’t be bad… but it would have a -0 appended to most of the directories; my preferred solution there would be to do something like omitting -${SLOT} if it’s 0. You’d have stable API documentation links, most of the intra-package and inter-package paths would be stable, and all in all you could drop the need for the document symlinking feature we currently have.

Unfortunately I’m expecting this to either require an EAPI bump or it’ll take a number of years before this can be properly implemented; I’ll probably have to either author myself – or find someone to author it for me – a GLEP to suggest changing dodoc. Contextually we should consider finding a better solution for compression, which is another problem we hit. Right now only the documentation installed with dodoc is getting compressed with the chosen compression program, which might be gzip, bzip2 or lzma, same as the man pages. While man pages gets processed after install and before binpkg/livefs merge steps, documentation is not.

But not all documentation needs to be compressed in the first place: HTML files (API documentation first of all); PDF files; code examples need to be accessible without compression; while we have a dohtml command to installing the web pages without compressing them, there is no equivalent for the other, and we have to rely on insinto/@doins@ pairs. Further on, with more and more autotools-based packages moving to autoconf 2.6x and supporting the --docdir option, we’re going to install more and more documentation directly into the directory, be it with the current ${PF} or other form; these won’t be compresses as they are, right now.

So, again thanks to Ben for actually challenging the status-quo; his insights here were the spark that made me think about this for a long time.