Finding a better blog workflow

I have been ranting about editors in the past few months, an year after considering shutting the blog down. After some more thinking out and fighting, I have now a better plan and the blog is not going away.

First of all, I decided to switch my editing to Draft and started paying for a subscription at $3.99/month. It’s a simple-as-it-can-be editor, with no pretence. It provides the kind of “spaced out” editing that is so trendy nowadays and it provides a so-called “Hemingway” mode that does not allow you to delete. I don’t really care for it, but it’s not so bad.

More importantly it gets the saving right: if the same content is being edited in two different browsers, one gets locked (so I can’t overwrite the content), and a big red message telling me that it can’t save appears the moment I try to edit something and the Internet connection goes away or I get logged out. It has no fancy HTML editor, and instead is designed around Markdown, which is what I’m using nowadays to post on my blog as well. It supports C-i and C-b with it just fine.

As for the blog engine I decided not to change it. Yet. But I also decided that upgrading it to Publify is not an option. Among other things, as I went digging trying to fix a few of the problems I’ve been having I’ve discovered just how spaghetti-code it was to begin with, and I lost any trust in the developers. Continuing to build upon Typo without taking time to rewrite it from scratch is in my opinion time wasted. Upstream’s direction has been building more and more features to support Heroku, CDNs, and so on so forth — my target is to make it slimmer so I started deleting good chunks of code.

The results have been positive, and after some database cleanup and removing support for structures that never were implemented to begin with (like primary and hierarchical categories), browsing the blog should be much faster and less of a pain. Among the features I dropped altogether is the theming, as the code is now very specific to my setup, and that allowed me to use the Rails asset pipeline to compile the stylesheets and javascripts; this should lead to faster load time for all (even though it also caused a global cache invalidation, sorry about that!)

My current plan is to not spend too much time on the blog engine in the next few weeks, as it reached a point where it’s stable enough, but rather fix a few things in the UI itself, such as the Amazon ads loading that are currently causing some things to jump across the page a little too much. I also need to find a new, better way to deal with image lightboxes — I don’t have many in use, but right now they are implemented with a mixture of Typo magic and JavaScript — ideally I’d like for the JavaScript to take care of everything, attaching itself to data-fullsize-url attributes or something like that. But I have not looked into replacements explicitly yet, suggestions welcome. Similarly, if anybody knows a good JavaScript syntax highligher to replace coderay, I’m all ears.

Ideally, I’ll be able to move to Rails 4 (and thus Passenger 4) pretty soon. Although I’m not sure how well that works with PostgreSQL. Adding (manually) some indexes to the tables and especially making sure that the diamond-tables for tags and categories did not include NULL entries and had a proper primary key being the full row made quite the difference in the development environment (less so in production as more data is cached there, but it should still be good if you’re jumping around my old blog posts!)

Coincidentally, among the features I dropped off the codebase I included the update checks and inbound links (that used the Google Blog Search service that does not exist any more), making the webapp network free — Akismet stopped working some time ago and that is one of the things I want to re-introduce actually, but then again I need to make sure that the connection can be filtered correctly.

By the way, for those who are curious why I spend so much time on this blog: I have been able to preserve all the content I could, from my first post on Planet Gentoo in April 2005, on b2evolution. Just a few months shorts of ten years now. I also was able to recover some posts from my previous KDEDevelopers blog from February that years and a few (older) posts in Italian that I originally sent to the Venice Free Software User Group in 2004. Which essentially means, for me, over ten years of memories and words. It is dear to me and most of you won’t have any idea how much — it probably also says something about priorities in my life, but who cares.

I’m only bothered that I can’t remember where I put the backup from blogspot I made of what I was writing when I was in high school. Sure it’s not exactly the most pleasant writing (and it was all in Italian), but I really would like for it to be part of this single base. Oh and this is also the reason why you won’t see me write more on G+ or Facebook — those two and Twitter are essentially just a rant platform to me, but this blog is part of my life.

What I’d like from my blog

My blog is, at this point, a vital part of my routine. I use my blog to write about my personal projects, I write about the non-restricted parts of my jobs, and I write about the work that goes into Gentoo Linux and other projects I follow.

I have over 2100 posts over time, especially thanks to the recent import of my original blog on Gentoo infrastructure. I don’t really know if it’s a lot, but sometimes Typo seems to miss something about it. Unfortunately I’m also running an older version of Typo, because I haven’t switched that virtual server to Ruby 1.9 yet as one of my customers is running a version of Radiant that is not going to work otherwise.

Said customer also bitched so hard, and screamed not to keep the site on my server, but as it happens the new webmasters that are supposed to pick up the website, and should have been cheaper and faster than me… have been working since June and still delivered nothing. Hopefully they’ll be done soon and I can kick said customer from the server.

Anyway, at this point there are a few things that I’d like to get out of my blogging platform in the future, which might require me to fork Typo and create my own version, which is likely going to be stripped down — as many things I really don’t care about, that are added here, like the short URLs, which I might just export as I think I used them at some point, but then I would handle through mod_rewrite rather than on the Rails side.

So let’s see what I don’t like about the current Typo I’m using:

  • The database access is more than a bit messed up; it probably has to do that upstream only cares about MySQL, while I want to run it on PostgreSQL; and this causes more than a couple of problems — have you noticed that sometimes my posts end up password-protected? Well, what happens is that the settings for the single posts are serialized in YAML and de-serialized, but somethings something bad happens and the YAML becomes invalid, causing the password-protection to kick in. I know there is an ActiveRecord extension that allows for key-value pairs to be stored in PostgreSQL-specific column types instead of having to (de)serialize them all the time, but again, this wouldn’t be something upstream would use.
  • Alternatively I’ve been toying with the idea of using MongoDB as a backend. Even with the issues that I have pointed out before, I think it might work well for a blog, especially since then the comments would be tied tot he post itself, rather than have the current connected tables.
  • There is a problem with the tags handling, again something upstream doesn’t seem to care about – at some point I remember reading they were mostly interested in making every single word in a post a tag to cross-connect posts with the same word; it’s one of the reasons why I’m not sure if I want to update it. If I change the title of one of the tags to make it more descriptive, then I edit a post that has that tag, it creates one more tag for each word in that title, instead of preserving the older tags. I really should clean up the tags I got right now.
  • I would also like that when I get to the “new post” page it would create it already and then get me back to editing it — this is important to me because sometimes if I have to restart Chromium, or suspend the laptop, something goes very wrong and it creates multiple drafts for the same post. And cleaning them up is a long task.
  • A better implementation of notification for new posts, and integration with Flattr, would be also very good. While IFTTT makes it easy to post the new entries to Twitter and LinkedIn, its lack of integration for Flattr is a major pain, and the fact that right now, to use auto-submit, I have to duplicate part of the content in the HTML of the pages, is also a problem. So being able to create a “Flattr thing” the moment when I actually post something would be a major plus for me.
  • Since I’m actually quite the paranoid, another thing I would like to have would be either two-factor authentication with Google Authenticator on a cellphone, or (actually, in addition to) certificate-based authentication for the admin interface. Having a safe way to make sure that I’m the only one logging in would make me remove some of the administrative interface rules on ModSecurity, which would in turn let me write posts from public WiFi networks sidestepping the problem I posted about the other day.
  • Scheduled posting. This used to be supported, but it’s been completely broken for years at this point, but it was very useful to me a long time ago since I would just write a bunch of posts and schedule them to be posted once a day. I suppose this should now be changed so that the planned posts are only actually posted if a process is called to make sure that the new posts are now “elapsed”… but again this is something that I’d like to have, and you readers would probably enjoy, as it would probably make for more and better content overall.

I definitely do not want to go with WordPress, I just wish I had the time to write my own Typo fork, and make it more usable for what I do, rather than hoping that the upstream development for Typo does not go in a direction I don’t like at all.. Maybe somebody else has the same requirements and would like to join me in this project; if so, send me an email.. maybe it’ll finally be the time I decide to start on the fork itself.

The pain of installing RT in Gentoo

Since some of my customers tend to forget what they asked me to do and then they complain that I’m going overbudget or overtime, I’ve finally decided to bite the bullet and set up some kind of tracker. Of course given that almost all of said customers are not technical at all, using Bugzilla was completely out of question. The choice fell back to RT that has a nice email-based interface for them to use.

Setting this up seemed a simple process after all: you just have to emerge the package, deal with the obnoxious webapp-config (you can tell I don’t like it at all!), and set up database and mod_perl for Apache. It turned out not to be so easy. The first problem is the one I already wrote about at least in passing: Apache went segfaulting on me when loading mod_perl on this server, and I didn’t care enough to actually go around and debug why that is.

But fear not, since I’ve already rented a second server, as I said, I decided to try deploying RT there, it shouldn’t be trouble, no? Hah, I wish.

First problem is that Apache refused to start because the webmux.pl script couldn’t be launched. Which was at the very least bothersome, since it also refused to actually show me some error message beside repeating to me that it couldn’t load it. I decided against trying mod_perl once again and moved to a more “common” configuration of lighttpd and reverse proxying, using FastCGI.

And here trouble starts getting even nastier. To begin with, FastCGI requires you to start rt with its own init script; the one provided by the current 3.8.10 ebuild is pretty much outdated and won’t work, possibly at all. I rewrote it (and the rewrite I’ll see to push into portage soon), and got it to at least try starting up. But even then it won’t start up. Why is that?

It has to do with the way I decided to fix up the database: since the new server will run at some point a series of WordPress instances (don’t ask!), it’ll have to run MySQL, but there will be other Web Apps that should use PostgreSQL, and as long as performance is not that big an issue, I wanted to keep one database software per server; this meant connecting to PostgreSQL running on Earhart (which is on the same network anyway), and to do so, beside limiting access through iptables, I set it to use SSL. That was a mistake.

Even though you may set authentication as trust in the pg_hba.conf configuration file, the client-side PostgreSQL library tries to see if there are authentication tokens to use, which in case of SSL can be of two kinds: passwords and certificates. The former is the usual clear-text password, the latter as the name implies is a SSL user certificate that can be used to validate the secure connection from one point to the other. I had no interest to use user certificates at that point, so I didn’t care much about procuring or producing any.

So when I start the rt service (without using --background, that is… I’ll solve that before committing the new init script), I get this:

 * Starting RT ...
DBI connect('dbname=rt3;host=earhart.flameeyes.eu;requiressl=1','rt',...) failed: could not open certificate file "/dev/null/.postgresql/postgresql.crt": Not a directory at /usr/lib64/perl5/vendor_perl/5.12.4/DBIx/SearchBuilder/Handle.pm line 106
Connect Failed could not open certificate file "/dev/null/.postgresql/postgresql.crt": Not a directory
 at //var/www/clienti.flameeyes.eu/rt-3.8.10/lib/RT.pm line 206
Compilation failed in require at /var/www/clienti.flameeyes.eu/rt-3.8.10/bin/mason_handler.fcgi line 54.
 * start-stop-daemon: failed to start `/var/www/clienti.flameeyes.eu/rt-3.8.10/bin/mason_handler.fcgi'                                                                       [ !! ]
 * ERROR: rt failed to start

Obviously /dev/null is the home of the rt user, which is what I’m trying to run this as, and of course it is not a directory by itself, trying to handle it as a directory will make the calls fail exactly as expected. And if you see this, your first thought is likely to be “PostgreSQL does not support connecting via SSL without an user certificate, what a trouble!”.. and you’d be wrong.

Indeed, if you look at a strace of psql run as root (again, don’t ask), you’ll see this:

stat("/root/.pgpass", 0x74cde2a44210)   = -1 ENOENT (No such file or directory)
stat("/root/.postgresql/postgresql.crt", 0x74cde2a41bb0) = -1 ENOENT (No such file or directory)
stat("/root/.postgresql/root.crt", 0x74cde2a41bb0) = -1 ENOENT (No such file or directory)

So it tries to find the certificate, doesn’t find it, and proceed to find a different one, if even that doesn’t exist, it gives up. But that’s not the case for the above case. The reason is probably a silly one: the library looks up errno to be ENOENT before ignoring the error, while the rest of them is likely considered fatal.

So how do you deal with a similar issue? The obvious answer would be to make the home directory to point to the RT installation directory, so that it’s also writeable by the user; in most cases this only requires you to set the $HOME variable, but that’s not the case for PostgreSQL, that instead decides to be smarter than that, and looks up the home directory of the user from the passwd file…

So why not changing the user’s home directory to the given directory then? One reason is that you could have multiple RT instances in the same system, mostly thanks to webapp-config, and another is that even with a single RT instance, the path to the installed code has the package’s version in it, so you would have to change the user’s home each time, which is not something you should be looking forward to.

How to solve this whole? Well, there is one “solution” that is what I’m going to do: set up RT on the same system as PostgreSQL, either with lighttpd or by using FastCGI directly within Apache, I have yet to decide; then there is the actual solution to solve this: get the PostgreSQL client library to respect $HOME and at the same time make it not throw a fit if the home directory is not really a directory. I just don’t think I have time to dedicate to the real fix for now.

Ruby-Elf and collision detection improvements

While the main use of Ruby-Elf for me lately has been quite different – for instance with the advent of elfgrep or helping verifying LFS support – the original reason that brought me to write that parser was finding symbol collisions (that’s almost four years ago… wow!).

And symbol collisions are indeed still a problem, and as I wrote recently they don’t get very easy on the upstream developers’ eyes, as they are mostly an indication of possible aleatory problems in the future.

At any rate, the original script ran overnight, generated a huge amount of database, and then required more time to produce a readable output, all of which happened using an unbearable amount of RAM. Between the ability to run it on a much more powerful box, and the work done to refine it, it can currently scan Yamato’s host system in … 12 minutes.

The latest set of change that replaced the “one or two hours” execution time with the current “about ten minutes” (for the harvesting part, there are two more minutes required for the analysis) was part of my big rewrite of the script so that it used the same common class interfaces as the commands that are installed to be used with the gem as well. In this situation, albeit keeping the current single-threaded (more on that in a moment), each file analysed consists of three calls to the PostgreSQL backend, rather than being something in the ballpark of 5 plus one per symbol, and this makes it quite faster.

To achieve this I first of all limited the round-trips between Ruby and PostgreSQL when deciding whether a file (or a symbol) has been already added or not. In the previous iteration I was already optimising this a bit by using prepared statements (that seemed slightly faster than direct queries), but they didn’t allow me to embed the logic into them, so I had a number of select and insert statements depending on the results of those, which was bad not only because each selection would require converting data types twice (from PostgreSQL representation to C, then from that to Ruby), but also because it required to call into the database each time.

So I decided to bite the bullet and, even though I know it makes it a bunch of spaghetti code, I’ve moved part of the logic in PostgreSQL through stored procedures. Long live PL/SQL.

Also, to make it more solid in respect to parsing error on single object files, rather than queuing all the queries and then commit them in one big single transaction, I create single transactions to commit all the symbols of an object, as well as when creating the indexes. This allows me to skip over objects altogether if they are broken, without stopping the whole harvesting process.

Even after introducing the transaction on symbols harvesting, I found it much faster to run a single statement through PostgreSQL in a transaction, with all the symbols; since I cannot simply run a single INSERT INTO with multiple values (because I might hit an unique constrain, when the symbols are part of a “multiple implementations” object), at least I call the same stored procedure multiple times within the same statement. This had tremendous effect, even though the database is accessed through Unix sockets!

Since the harvest process now takes so little time to complete, compared to what it did before, I also dropped the split between harvest and analysis: analyse.rb is gone, merged into the harvest.rb script for which I have to write a man page, sooner or later, and get installed properly as an available tool rather than an external one.

Now, as I said before, this script is still single-threaded; on the other hand, all the other tools are “properly multithreaded”, in the sense that their code fires up a new Ruby thread per each file to analyse and the results are synchronised not to step on each other’s feet. You might know already that, at least for what concerns Ruby 1.8, threading is not really implemented and green threads are used instead, which means there is no real advantage in using them; that’s definitely true. On the other hand, on Ruby 1.9, even though the pure-Ruby nature of Ruby-Elf makes the GIL a main obstacle, threading would improve the situation by simply allowing threads to analyse more files while the pg backend gem would send the data over to PostgreSQL (which would probably also be helped by the “big” transactions sent right now). But what about the other tools that don’t use external extensions at all?

Well, threading elfgrep or cowstats is not really any advantage on the “usual” Ruby versions (MRI18 and 1.9), but it provides a huge advantage when running them with JRuby, as that implementation has real threads, it can scan multiple files at once (both when using asynchronous listing of input files with the standard input stream, and when providing all of them in one single sweep), and then only synchronise to output the results. This of course makes it a bit more tricky to be sure that everything is being executed properly, but in general makes the tools just the more sweet. Too bad that I can’t use JRuby right now for harvest.rb, as the pg gem I’m using is not available for JRuby, I’d have to rewrite the code to use JDBC instead.

Speaking about options passing, I’ve been removing some features I originally implemented; in the original implementation, the arguments parsing was asynchronous and incremental, without limits to recursion; this meant that you could provide a list of files preceded by the at-symbol as the standard input of the process, and each of that would be scanned for… the same content. This could have been bad already for the possible loops, but it also had a few more problems, among which there was the lack of a way to add a predefined list of targets if none was passed (which I needed for harvest.rb to behave more or less like before). I’ve since rewritten the targets’ parsing code to only work with a single-depth search, and relying on asynchronous arguments passing only through the standard input, which is only used when no arguments are given, either on command line or by default of the script. It’s also much faster this way.

For today I guess all these notes about Ruby-Elf would be enough; on the other hand, in the next days I hope to provide some more details about the information the script is providing me.. they aren’t exactly funny, and they aren’t exactly the kind of things you wanted to know about your system. But I guess this is a story for another day.

Sysadmining tips from a non-sysadmin

I definitely am not a sysadmin, although as a Gentoo developer I have to have some general knowledge of sysadmining, my main work is development, and that’s what I have most of my experience with. On the other hand, I picked up some skills by maintaining the two VPSes (the one where this blog is hosted, and the one hosting xine’s bugzilla — as well as site).

Some of these tricks are related with the difficulties I have reported with using Gentoo as a guest operating system into virtual servers, but a few are totally not related. Let me try to relay some of the tricks I picked up.

The first trick is to use metalog for logging; while syslog-ng has some extra features missing in metalog (like the network logging support), for a single server, the latter is much much easier to set up and deal with. But I find the default configuration a bit difficult do deal with. My first step is then to replace the everything logging with a therest logging, by doing something along these lines:

Postgresql :
  program_regex = "^postmaster"
  program_regex = "^postgres"
  logdir   = "/var/log/postgres"
  break    = 1

Apache :
  program_regex = "^httpd"
  logdir   = "/var/log/http"
  break    = 1

The rest of important stuff :
  facility = "*"
  minimum  = 6
  logdir   = "/var/log/therest"

See that break statement? The whole point of it is to not fall back into the entries below in the file, at the end, the therest block will log everything that does not fall into previous directories. My reason to split these in this way is that I can easily check the logs for cron or postgresql and at the same time check if there is something I’m not expecting.

While using metalog drops the requirement for logrotate for the system logs, it doesn’t stop it to be needed for other log systems; quassel doesn’t log to syslog, nor does Portage, and the Apache access logs are better handled without using syslog to pass them through awstats later. Note: having portage to log to syslog is something I might make good use of; it would break qlop, but it might be worth it for some setups, like my two VPSes. But even with this limitation, metalog makes it much easier to deal with the basic logs.

The next step to simplify the management for me has been switching from Paul Vixie’s cron to fcron. The main reason is that fcron sounds “modern” compared with Vixie’s, and it has a few very useful features that makes it much easier to deal with: erroronlymail sends you mail about the cron jobs only if their status report is non-zero (failure) rather than every time if there is output; random makes it possible to avoid running heavy-handed jobs always at the same time (it makes the system altogether more secure, as an attacker cannot guess that at a given time the system will be having extra-load!), and lavg options allow you to skip running a series of jobs if the system is busy doing something else.

Oh and another important not for those of you using PostgreSQL; I learnt the hard way the other day that the default logging system of the PGSQL server is to write a postmaster.log file inside the PostgreSQL data directory. This file does not really need to be rotated as postgres seems to take care of that itself; on the other hand, it makes much more sense to leave the task to the best software: the logger! To fix this up you have to edit the /var/lib/postgresql/8.4/data/postgresql.conf file (you may have to change 8.4 to the version of PGSQL you’re running), and add the following line:

log_destination = 'syslog'

Thanks to metalog’s buffering and all its features, it should make it much easier on the I/O of the system especially if the load is very very high. Which sometimes happened to be when my blog went hammered.

Okay probably most of this is nothing useful to seasoned sysadmins, but small hints from non-sysadmin are something that gets useful for other non-sysadmin on the job for the same reason.

Ebuilds have to be done right

There is quite some stir right now in the gentoo-dev mailing list following a mass-masking and for removal of packages for QA and security reasons; I think that Alec nailed down most of the issues with his comments:

> This thread is yet another proof that we need to introduce a “Upcoming
> masking” for unmaintained packages.

<sarcasm>

Shall I file those forms in triplicate and fax them to the main office sir?

</sarcasm>

Since amazingly I actually started the Treecleaners project; the
intent was actually to fix problems with packages. Part of the
problem is that there are hundreds of packages in the tree and the
fixes vary in complexity so it is difficult to create hard-and-fast
rules on when to keep a package versus when to toss it. One of the
things I like about masking is that it quickly gets people who
actually care about the package up to bat to fix it instead of leaving
it broken for months. I realize maintainers do not exactly enjoy this
kind of poking, however when things have been left for long enough I
believe our options become a bit more limited (in this case, masking
for removal due to unfixed sec bugs.)

Now, this is one issue I already partly addressed in my post about the five minutes fix myth but I’d like to remind again that even though we can easily spot some blatant problems with packages, having a package that compiles and that passes the obvious, programmatic QA checks does not really tell you much about the health status of the package; indeed, you won’t know whether the package works at all for the final users. Tying to another post of mine (incidentally, someone complained about my self-reference to posts… should I stop giving pointers and context?), I have to admit that sometimes it’s impossible to have a 100% coverage of packages, among other reasons because some packages need particular hardware, or particular software components set up, to be able to test them effectively. On the other hand, when such a complex setup isn’t strictly needed, we should expect some level of testing when making changes, minor or otherwise.

Sometimes, the mistakes are in the messages logged by the ebuild, at other times, the problem is that some important part of the package is missing, for example because the install phase is manually written in the ebuild, and upstream has added some extra utility that is installed by make install but is obviously ignored by the ebuild (and this actually is one of the points that Donnie brought up when I suggested to override upstream build systems with an eclass: we’d have to triple-check the new releases to make sure that no further source files or objects or libraries were added from the previously-packaged version). All these things are almost impossible to identify in a nice, programmatic scripted way, and need knowledge of a package, checking the release notes having an idea how to test the package.

For instance, I’ve been looking into sys-libs/libnss-pgsql today, as I have an interest on it; the ebuild installs the shared library manually (skipping libtool’s relinking phase, by the way); why did it do that? It takes four steps rather than the one needed for make install… well, the reason was obvious (but not commented upon!) after changing it to use make install: a post-install check actually aborted the merge: the problem was that the package installed the Name Service Switch library in /lib, but also installed the static archive and the libtool .la file, both of which are definitely not needed in /lib. The handwritten install solution solves the symptoms but not the following problems:

  • it will still build the static archive (non-PIC) version, causing twice the number of compiler calls;
  • it won’t tell upstream that they forgot one thing in their Makefile.am;
  • it’s still wrong because the libraries it links to are not available in /lib: it won’t be working before mounting /usr if /usr is on a different partition (who does still do that, nowadays?!) — it should be in /usr itself, at this point (and yes, you can do that: both GNU libc and FreeBSD – which has a different NSS interface by the way – check both /usr and /usr/lib).

Incidentally, why does glibc’s default nsswitch.con use db files for services, protocols, svc and ethers? Their presence in there means that each time you call into glibc to resolve a port name, it makes eight open() syscalls trying to find the file. It doesn’t sound too right.

I have patches, and I have a new ebuild, I’ll see to send them upstream and get it committed (by someone else, or by picking maintainership for it) in the next day or so. In the mean time I have to get back to my work.

Testing environments

I don’t feel too well, I guess the anger caused by the whole situation, coupled with lots of work to do (including accounting, as it’s that time of the year, for the first time in my case), and a personal emotional situation that went definitely haywire. I’m trying to write this while working on some other things, and eating, and so on so forth, so it’ll might not be too coherent in itself.

In yesterday’s post I pointed to a post by Ryan regarding testsuites, and the lack of consistent handling of testsuites when making changes. While it is true that there are a lot of ways for test failures to go undetected, I think there are some more subtle problems with a few of the testsuites I encountered in the tinderbox project.

One of these problems I already noted yesterday and it’s the lack of a testsuite from upstream. This involves all kind of projects, final user utilities, libraries (C, Ruby, Python, Perl), and daemons. For some of those, the problem is not as much as there is no testsuite, but rather that the testsuite doesn’t get released together with the code, for some reasons (most of which end up being that the testsuite outweighs the code itself many times), and that it’s not as easy to track down where the suite is. For Ruby packages, more than a few times we end up having to download the code from GitHub rather than using the gem, for instance (luckily, this is almost easy for us to do, but I’ll try not to digress further).

Some tests also depend on specific hardware or software components, and those are probably the ones that give the worst headaches to developers. For what concerns hardware, well, it’s tough luck, you either have the hardware or don’t (there is one more facet regarding the fact that you might have the access but you might not be able to access it but let’s not dig into that). The fun start when you have dependencies on some particular software component. This does not mean depending on libraries or tools, those are given and cannot be solved in any other way beside actually adding the dependencies, but rather depending on services and daemons being running.

Let’s take for instance the testsuite for dev-ruby/pg, that is the PostgreSQL bindings extension for Ruby. Since you have to test that the bindings work, you need to be able to access PostgreSQL; obviously you shouldn’t be running this against a production PostgreSQL server, as that might be quite nasty (think if the tests actually went to access or delete your data). For this reason, the 0.8 series of the package does not have any testsuite (bad!). This was solved in the new 0.9 series, as upstream added the support to launch a private, local copy of PostgreSQL to test with. This actually adds another problem but I’ll go back to that later on.

But if database server related problems are quite obvious (and thus why things like ActiveRecord only have tests running with SQLite3 that does not need any service running), there are worse situations. For instance, what about all the software communicating through DBus? The software assumes being able to talk with the system instance of D-Bus to work, but what if you’re going to test disruptive methods and there is a local, working, installed copy of the same software? In general you don’t want for tested software to interact with the software running on the system. On the other hand, there are a number of packages that fails their tests if DBus is not running, or in the case of sbcl if there is no syslog listening to /dev/log. These will also create quite a stir, as you might guess.

Now, earlier I said that the new support for launching a local instance of PostgreSQL in the pg 0.9 series creates one further problem; that problem is that it now adds one limitation on the enviornment: you have to be able to start PostgreSQL from the testsuite; what’s the problem with that? Well, to be able to run the PostgreSQL commands you need to drop privileges to non-root, so if you run the testsuite as root you’ll fail… and while Portage does allow to run tests as non-root, I’m afraid it’s still defaulting to root (FEATURES=userpriv is the one that controls the behaviour). And even if the default was changed, there are other tests that only work as root or even some, like libarchive’s, that run slightly different tests depending on which users you’re running them as. If you run them as root, they’ll ensure, for instance, that the preservation of users and permissions work; if you run them as non-root, that you cannot write as a different user or cannot restore the permissions.

You can probably start to see what the problem is with tests: they are not easy; getting them right is difficult, and most often than not, the upstream tests only work in particular environmental conditions that we cannot reproduce properly. And a failure in the testsuites is probably one of the most common showstopper for a stable request (this is important when the older version worked properly, while the new one fails, as regressions in stable have a huge marginal cost!).

What’s wrong with Gentoo, anyway?

Yesterday I snapped and declared my intent to resign from Gentoo together with stopping the tinderbox and leaving the use of Gentoo either. Why did that happen? Well, it’s a huge mix of problems, all joined together by one common factor: no matter how much work I pour into getting Gentoo working like it should be, more problems are generated by sloppy work from at least one or two developers.

I’m not referring about the misunderstandings about QA rules, which happens and are naturally caused by the fact we’re humans and not being of pure logic (luckily! how boring it would be otherwise, to always behave in the most logical way!). Those can upset me but they are still after all no big deals. What I’m referring to is the situation where one or two developers can screw up the whole tree without anybody being (reasonably) able to do a thing about it. We’ve had to two (different) examples in the past few months, and while both have undeniably bothered QA, users, and developers alike, no action has been taken in any of these cases.

We thus have developer A, who decided that it’s a good idea to force all users to have Python 3 installed on their systems, because upstream released it (even when upstream consider it still experimental, something to toy with), and who kept on ignoring calls for dropping that from both users and developers (luckily, the arch teams are not mindless drones, and wouldn’t let this slide to stable as he intended in the first place). The same developer also hasn’t been able to properly address one slight problem with the new wrapper after months from the unleashing of that to the unstable users (unstable does not mean unusable).

Then we have developer B who feels like the tree’s saviour, the only person who can make Gentoo bleeding edge again… while most of if not all of the rest the developer pool is working on getting Gentoo more stable and more maintainable. So, among the things he went on doing, there was a poorly-performed Samba bump (suboptimal was the term he used — I ended up having to fix the init scripts myself because they weren’t stopping/restarting properly, as the ebuild and the init scripts went out of sync regarding paths), some strangely incomplete PostgreSQL changes, and a number of minor problems with the packages.

Of the two, I was first upset most by the former, but on the long run, the latter is the one who drove me mad. Let’s not dig too much on the stance about --as-needed (cosmetics — yeah because being able to return from a jpeg bump with less than 100 packages, rather than the whole world, is just cosmetics), and the fact that he’s ignored most of the QA issues with the packages he touched. Instead look at the behaviour with a package of mine (alas, I made the mistake of let this one slip with just a warning, I should have taken the chance to actually defer it to devrel…): vbindiff.

The package is something I added a while ago because from time to time it comes out useful. I’m in metadata.xml; I’m definitely not an unresponsive maintainer. Yet, while my last bump was on June 2008, the version in tree was not the latest one up to last September (2009). Why? A quick glance at the homepage shows that the beta4 release was mostly fixing a Win32 bug, and introducing a way to enable debug-mode. So what happens? Our mighty developer decides to go on and bump the package; without asking me; with nobody asking him; without a mail, a nod or anything. I literally notice this as emerge tries to upgrade a package I know I maintain. You’d expect for the debug support to be present in the ebuild then, and you’d find a debug USE flag if you checked now indeed, but that’s something I added myself afterwards, as the damage of pointlessly bumping something was already done.

Now, why did that happen? Well, he admitted he just went through the dev-* categories, without considering maintainers declared in metadata, and blindly bumped ebuilds when the latest version available on the site was higher than the one in tree. Case in point he had to open the vbindiff site and thus the release notes regarding Win32 and --enable-debug would have been clearly visible, if he cared to even read part of them. Whoever tried doing serious ebuild business should know that most of the time even the upstream-provided release notes are not something to go on by… Interestingly enough, his bleeding-edge hunger didn’t make him ask for a new stable, and we currently have a very old one.

So there we have your developer B, the super-hero, the last good hope of the bleeding edge, who bumps packages without consulting the guy who maintain them (and is around almost 247) and without even caring to use them at all. Why did I let it slip? Because I was most focused on trying to stop developer A at the time is probably the right answer. I did issue a reprimand reminding him to not touch someone else’s packages, and to learn using package.mask for things like Samba. I was hoping he would listen. Oh boy, was I ever so wrong.

Speaking a second again about Samba, did I tell you yet that the split into multiple packages was done, straight to ~arch, without any plan to follow-up to convert dependencies? Wonder why the whole thing is now stalemated again. Maybe the arch teams don’t see it all too well to have the same kind of dependency breakage in stable as there was/is on unstable right now.

First-hand information about our developer B wants him to be inlined with a zealot point of view regarding the Mono project — you’d then guess that dotnet stuff would be the last thing he’d be touching, but instead, without any questioning, ignoring the fact I stated at FOSDEM that I was going to look into that as soon as I had time, the fact that I stated before multiple times that I was already working on un-splitting the gtk-sharp packages, and the fact that I took contact with the Mono developers (again at FOSDEM) to try following upstream more closely. Oh and the one thing that pissed me off about that bump? Beside the fact that tomboy now refuses to work? Remember this patch? It was dropped; without even mailing me if I had or could make a version for the latest version. It was dropped in unstable (or, how it should be called if this kind of stuff is allowed to continue, unusable).

And the cherry on top? As I said, this developer touched Samba, PostgreSQL, now Mono… there are three aliases for these things (samba, pgsql-bugs and dotnet), who the bugs are assigned to… he’s on none of them! And before somebody tries to argue that, I’m pretty confident he’s not following the aliases on the Bugzilla (plus, given he also argued that the problem was with leaving security-vulnerable stuff in the tree – which by the way means having working, complete, safe ebuilds to be able to mark stable, and he doesn’t seem to be able to come up with any of those – the most important security bugs don’t get sent to watchers). How does he suppose to see the bugs coming? Oh but by wrangling the bug himself! Yeah, after all developers don’t file bugs themselves assigning them straight to the maintainers by procedure, do they? (fun fact: Bugzilla queries report at most 5K bugs, so that list is a very much limited result from what I was hoping to get); nor do other developers ever wrangle it would be silly, and there is no Arch Tester to speak of, right?

You can now see most of the pictures, and why I’m mostly upset with developer B. What made me snap yesterday were remarks that insisted that I was just “whining” and “not doing enough” as bugs kept piling up. What the heck? I constantly had over 1000 bugs (over 1300 today) for the past year or so, I know very well that bugs keep piling up! And I’ve been doing all I can do outside of my work hours (while I have to thank some people, including Paul, David, Simon, Andrew and Bela for their contributions, I’m not paid to do Gentoo work; and while I do get to use it, and thus contribute back to, for some of the jobs I take, it’s definitely not the same as working on Gentoo), including the whole RubyNG porting and improvement trying to make sure we can actually get to a point where unmasking Ruby 1.9 will not break any user whatsoever. Am I really doing too little? ”Not enough”?

Okay so the proper way to handle this, with the current procedures, would be to take this up to the Developers’ Relations so that they could act on it; QA can only ask infra to restrict commit access if we’re expecting a grave and dangerous breaking of the tree, or misuse of commit rights. So why didn’t I bring this up to devrel? Well, the main reason is that devrel nowadays, as far as I can tell, is exactly three people: Petteri, Denis and Jorge, and of the three the only one who’s for preventive suspension of commit rights is Denis (this has been proven with the case about developer A above); one out of three does not really sound much of a chance for this to improve the situation. And if – again as happened with developer A – DevRel then decided that the right action would be to issue a reprimand, that would amount to scolding the developer and asking to work more with others… well, it wouldn’t change a thing.

The whole QA system has to change! We’ve got to write down guidelines, rules, and laws, and be conservative in applying them. You shouldn’t go around breaching them and then appealing when QA finds you out of line, you should talk with QA if you feel the rule is misapplied to your case in any way.

So here you go, in a nutshell, why my preservation instinct right now is telling me to flee. I’m not sure yet if I’ll outright flee or just give it time for the situation is addressed and then decide. The reason is: I still like the Gentoo system, and since I rely on it for my work I cannot leave it alone; if I were to move to anything else I would have to spend (waste?) even more time to fix the same issues anyway, and I’d much rather get Gentoo working right. But I cannot do this alone, I cannot do this especially if I have support neither from developers nor users. So please voice your concern.

If you feel like Gentoo needs the better QA, if you feel like we shouldn’t be translating unstable to unusable, then please ask for it. I’m not saying that we should become stale like Debian stable, but if it takes a few months to get something straight, then it should take its time and not be forced through (that’s what the Ruby team has been doing all this time to work with Ruby 1.9 and Ruby EE and other implementations as well!). If you use Twitter, identi.ca, Digg, Reddit, Slashdot, whatever, get this post running. Maybe I’m subverting the process, but to quote BBC’s NewsQuiz, “Trial by media is the most efficient form of justice” (this was in reference to the British MP expenses scandal last year), and right now my only concern is effectiveness.

Upgrading Typo

It’s again that time of the year when I get tired and decide to update Typo; although this is probably one of the most invasive changes since I started using this software (it moved from Subversion to GIT), it seems to have been the one with less issues to fix. Although some were quite nasty.

First problem is that the version you can find in git right now is broken and won’t start up… and to “blame” is our very own Hans (graaf)! I say “blame” because it’s not really his fault, and I just worked it around by removing the Dutch localisation, which I don’t need anyway. I had to fix up the theme to work with the new Typo code, but I also took the time to fix a few more obnoxious things in the theme so that it now looks nicer too.

There were just two main issues with the update: the easy one is that the users don’t get activated after migration, which means you cannot login, a psql call after and I’m back in; the other problem is that the Atom feed generated was invalid, because to replace the HTML entities like eacute for é and similar, it decoded all the entities… included the lt/gt used to avoid injecting tags into the posts; luckily for me I had one such “fake tag” right in my previous post so I have noticed this problem right away; I hacked it around for now, as soon as I have little more time I’m going to actually fix it properly.

I had to update the mod_security access restriction since now comments are posted through a single /comments URI, which actually makes it much nicer. I’m going to update it and post it ASAP. The live search support (which I already removed from the template a few days ago) now is optional in Typo itself, which is good; I have to port the Google custom search code to the new sidebar plugin interface, to make it blend better.

I’ll require a couple more packages to be added to portage to work too, but that can happen later, not today that I’m already swamped a bit. I like the new admin interface, although it increased the size of fonts (although not as much as WordPress) and I hate huge fonts (Firefox does not allow me to use smaller fonts it seems; the rest of my system is set to 75dpi – which is fake – but works fine, Firefox does not accept that).

I’m glad I’m not a DBA

Today, even though it’s the new year’s eve, I’ve spent it working just like any other day, looking through the analysis log for my linking collisions script, to find some more crappy software in need of fixes. As it turns out, I found quite a bit of software, but I also confirmed to myself I have crappy database skills.

The original output of the script, already taking quite a long time to process, didn’t sort the symbols by name, but just by count, so to show the symbols with most collisions first and the ones that related to one or two files later. It also didn’t sort the name of the objects where the symbols could be find, which caused quite an issue as from time to time the list changed sorting so the list of elements wasn’t easy to compare between symbols.

Yesterday I added sorting to both fields so that I could have a more pleasant og to read, but it caused the script to slow down tremendously. At which point I noticed that maybe, just maybe, PostgreSQL didn’t optimise my tables, even though I had created views, in the hope of it being smart enough to use them as optimisation options. So I created two indexes, one for the name of the objects and one for the name of the symbols, with the default handler (btree).

The harvesting process now slowed down of a good 50%. Instead of taking less than 40 minutes, it took about an hour, but then when I launched the analysis script, it generated the whole 30MB log file in a matter of minuts rather than requiring me hours, I never have been able to let the analysis script complete its work before, and now it did it in minutes.

I have no problem to say that my database skills suck, which is probably why I’m much more of a system developer than a webapp developer.

Now at least i won’t have many more doubts about adding a way to automatically expand “multimplementations”: with the speed it has now I can well get it to merge in the data from the third table without many issues. But still, seeing how much my SQL skills are pointless, I’d like to ask some help on how to deal with this.

Basically, I have a table with paths, each of which refers to a particular object, which I call “multimplementation” (and groups together all the symbols related to a particular library ignoring things like ABI versioning and different sub-versions). For each of the multimplementation I have to get a descriptive name to report to users. When there is just one path linked to that object, that path should be used; when there are two paths, the name of the object, plus the two paths should be used; for more than two paths, the object name and the path of the first object should be used, with ellipses to indicate that there are more.

If you want to see the actual schema, you can find it on ruby-elf’s repository in the tools directory.

There are more changes to the database that I should do to make it much more feasible to connect the paths (and thus the objects) to the package names, but at least now with the speed it took it seems to be feasible to run these check on a more stable basis on the tinderbox. If only I could find an easy way to have incremental harvesting, I might as well be able to run it on my actual system too.