The end of an era, the end of the tinderbox

I’m partly sad, but for the most part this is a weight that goes away from my shoulders, so I can’t say I’m not at least in part joyful of it, even though the context in which this is happening is not exactly what I expected.

I turned off the Gentoo tinderbox, never to come back. The S3 storage of logs is still running, but I’ve asked Ian to see if he can attach everything at his pace, so I can turn off the account and be done with it.

Why did this happen? Well, it’s a long story. I already stopped running it for a few months because I got tired of Mike behaving like a child, like I already reported in 2012 by closing my bugs because the logs are linked (from S3) rather than attached. I already made my position clear that it’s a silly distinction as the logs will not disappear in the middle of nowhere (indeed I’ll keep the S3 bucket for them running until they are all attached to Bugzilla), but as he keeps insisting that it’s “trivial” to change the behaviour of the whole pipeline, I decided to give up.

Yes, it’s only one developer, and yes, lots of other developers took my side (thanks guys!), but it’s still aggravating to have somebody who can do whatever he likes without reporting to anybody, ignoring Council resolutions, QA (when I was the lead) and essentially using Gentoo as his personal playground. And the fact that only two people (Michał and Julian) have been pushing for a proper resolution is a bit disappointing.

I know it might feel like I’m taking my toys and going home — well, that’s what I’m doing. The tinderbox has been draining on my time (little) and my money (quite more), but those I was willing to part with — draining my motivation due to assholes in the project was not in the plans.

In the past six years that I’ve been working on this particular project, things evolved:

  • Originally, it was a simple chroot with a looping emerge, inspected with grep and Emacs, running on my desktop and intended to catch --as-needed failures. It went through lots of disks, and got me off XFS for good due to kernel panics.
  • It was moved to LXC, which is why the package entered the Gentoo tree, together with the OpenRC support and the first few crude hacks.
  • When I started spending time in Los Angeles for a customer, Yamato under my desk got replaced with Excelsior which was crowdfounded and hosted, for two years straight, by my customer at the time.
  • This is where the rewrite happened, from attaching logs (which I could earlier do with more or less ease, thanks to NFS) to store them away and linking instead. This had to do mostly with the ability to remote-manage the tinderbox.
  • This year, since I no longer work for the company in Los Angeles, and instead I work in Dublin for a completely different company, I decided Excelsior was better off on a personal space, and rented a full 42 unit cabinet with Hurricane Electric in Fremont, where the server is still running as I type this.

You can see that it’s not that ’m trying to avoid spending time to engineer solutions. It’s just that I feel that what Mike is asking is unreasonable, and the way he’s asking it makes it unbearable. Especially when he feigns to care about my expenses — as I noted in the previously linked post, S3 is dirty cheap, and indeed it now comes down to $1/month given to Amazon for the logs storage and access, compared to $600/month to rent the cabinet at Hurricane.

Yes, it’s true that the server is not doing only tinderboxing – it also is running some fate instances, and I have been using it as a development server for my own projects, mostly open-source ones – but that’s the original use for it, and if it wasn’t for it I wouldn’t be paying so much to rent a cabinet, I’d be renting a single dedicated server off, say, Hetzner.

So here we go, the end of the era of my tinderbox. Patrick and Michael are still continuing their efforts so it’s not like Gentoo is left without integration test, but I’m afraid it’ll be harder for at least some of the maintainers who leveraged the tinderbox heavily in the past. My contract with Hurricane expires in April; at that point I’ll get the hardware out of the cabinet, and will decide what to do with it — it’s possible I’ll donate the server (minus harddrives) to Gentoo Foundation or someone else who can use it.

My involvement in Gentoo might also suffer from this; I hopefully will be dropping one of the servers I maintain off the net pretty soon, which will be one less system to build packages for, but I still have a few to take care of. For the moment I’m taking a break: I’ll soon send an email that it’s open season on my packages; I locked my bugzilla account already to avoid providing harsher responses in the bug linked at the top of this post.

Tinderbox and expenses

I’ve promised some insight into how much running the tinderbox actually costed me. And since today marks two months from Google AdSense’s crazy blacklisting of my website, I guess it’s a good a time as any other.

SO let’s start with the obvious first expense: the hardware itself. My original Tinderbox was running on the box I called Yamato, which costed me some €1700 and change, without the harddrives, this was back in 2008 — and about half the cost was paid with donation from users. Over time, Yamato had to have its disks replaced a couple of times (and sometimes the cost came out of donations). That computer has been used for other purposes, including as my primary desktop for a long time, so I can’t really complain about the parts that I had to pay myself. Other devices, and connectivity, and all those things, ended up being shared between my tinderbox efforts and my freelancing job, so I also don’t complain about those in the least.

The new Tinderbox host is Excelsior, which has been bought with the Pledgie which got me paying only some $1200 of my pocket, the rest coming in from the contributors. The space, power and bandwidth, have been offered by my employer which solved quite a few problems. Since now I don’t have t pay for the power, and last time I went back to Italy (in June) I turned off, and got rid of, most of my hardware (the router was already having some trouble; Yamato’s motherboard was having trouble anyway, I saved the harddrive to decide what to do, and sold the NAS to a friend of mine), I can assess how much I was spending on the power bill for that answer.

My usual power bill was somewhere around €270 — which obviously includes all the usual house power consumption as well as my hardware and, due to the way the power is billed in Italy, an advance on the next bill. The bill for the months between July and September, the first one where I was fully out of my house, was for -€67 and no, it’s not a typo, it was a negative bill! Calculator at hand, he actual difference between between the previous bills and the new is around €50 month — assuming that only a third of that was about the tinderbox hardware, that makes it around €17 per month spent on the power bill. It’s not much but it adds up. Connectivity — that’s hard to assess, so I’d rather not even go there.

With the current setup, there is of course one expense that wasn’t there before: AWS. The logs that the tinderbox generates are stored on S3, since they need to be accessible, and they are lots. And one of the reasons why Mike is behaving like a child about me just linking the build logs instead of attaching them, is that he expects me to delete them because they are too expensive to keep indefinitely. So, how much does the S3 storage cost me? Right now, it costs me a whopping $0.90 a month. Yes you got it right, it costs me less than one dollar a month for all the storage. I guess the reason is because they are not stored for high reliability or high speed access, and they are highly compressible (even though they are not compressed by default).

You can probably guess at this point that I’m not going to clear out the logs from AWS for a very long time at this point. Although I would like for some logs not to be so big for nothing — like the sdlmame one that used to use the -v switch to GCC which causes all the calls to print a long bunch of internal data that is rarely useful on a default log output.

Luckily for me (and for the users relying on the tinderbox output!) those expenses are well covered with the Flattr revenue from my blog’s posts — and thank to Socialvest I no longer have to have doubts on whether I should keep the money or use it to flattr others — I currently have over €100 ready for the next six/seven months worth of flattrs) Before this, between my freelancer’s jobs, Flattr, and the ads on the blog, I would also be able to cover at least the cost of the server (and barely the cost of the domains — but that’s partly my fault for having.. a number).

Unfortunately, as I said at the top of the post, there no longer are ads served by Google on my blog. Why? Well, a month and a half ago I received a complain from Google, saying that one post of mine in which I namechecked a famous adult website, in the context of a (at the time) recent perceived security issue, is adult material, and that it goes against the AdSense policies to have ads served on a website with adult content. I would still argue that just namechecking a website shouldn’t be considered adult content, but while I did submit an appeal to Google, a month and a half later I have no response at hand. They didn’t blacklist the whole domain though, they only blacklisted my blog, so the ads are still showed on Autotools Mythbuster (which I count to resume working almost full time pretty soon) but the result is bleak: I went down from €12-€16 a month to a low €2 a month due to this, and that is no longer able to cover for the serve expense by itself.

This does not mean that anything will change in the future, immediate or not. This blog for me has more value than the money that I can get back from it, as it’s a way for me to showcase my ability and, to a point, get employment — but you can understand that it still upsets me a liiiittle bit the way they handled that particular issue.

Tinderbox logging away!

So finally after spending one full day alone, at my sister’s, yesterday, I finished at least the part of the Tinderbox log analaysis code that takes care of gathering the logs and highlighting the bad stuff on them.

The code is all available and you can easily improve upon it if you want; now I think I should have put everything back together in a single git repository with the tinderboxing script, but that can be arranged at a different point, hopefully.

What is interesting though is discussing the design, because by all means it’s not a simple one, and you can be deceived into thinking I was out of my mind when I wrote it.

First of all there is the fact that the analysis itself is offloaded to a different system; in this case a different container. The reason for this is simply the matter of reliability of the tinderbox scripts themselves. Due to the way it works, it’s easy that even system-level software can break during upgrade, which is one of the reasons why it’s not totally easy to automate the process. Due to this I’m not interested in adding either analysis logic, nor complex “nodes” into the tinderbox host. My solution has been relatively easy: I just rely on tar and nc6 — I would have loved using just busybox for the whole of it, but the busybox implementation of netcat does not come with the -q option which is required to complete disconnection once the standard input is terminated.

Using tar gives it a very bare protocol I can use to provide a pair of filename and content which can then be analysed on the other side, with the Ruby script in the repository linked at the top of the post; this script uses archive-tar-minitar with a special patched version in Gentoo as otherwise it wouldn’t be able to access “streamed” archives — I’m tempted to fork it and release a version 0.6 with a few more fixes and more modern code, but I just don’t have the time right now; if you are interested ping me.

One important thing to note here is that the script uses Amazon’s AWS, in particular S3 and SimpleDB. This might not be obvious, as the system has enough space to store the logs files for a very long time. So why did I do it that way? While storage abounds, Excelsior resides in the same network as my employer’s production servers (well, on the same pipe, not on the same trusted network of course!), so to not swamp it too much, I don’t want to give anybody access to the logs on the system itself. Using S3 should be cheap enough that I can keep them around for a very long time!

Originally I planned on having the script be called one-off by xinetd, using spawned multiple process to avoid using threading (which is not Ruby’s forte), but then the time taken for AWS to be initialised wasn’t worth it, so I wrote it as it is now. Yes there is one bad thing: the script expects Ruby 1.9 and won’t work with Ruby 1.8. Why? Well, mainly because this way it was easier to write it, but then again, since I’m going to need concurrent processing at some point, which means I need to make the script multithreaded, Ruby 1.9 is a good choice. After all I can decide what I run, no?

After the log is received, the file is split line-by-line and for each of them a regexp is applied – an extra thank to blackace and Joachim for helping with a human OCR over a very old, low-res screenshot of my emacs window with the tinderbox logs grep command – and if there are any matches, the lines are marked as red. This creates very big HTML files obviously, but they should be fine. If they’ll start pile up, I’ll see to compress them before storing them to Amazon.

The use of SimpleDB is simply because I didn’t want to have to set up two different connections. Since all AWS services use the same login, I only need one and it uses both the storage and the database. While SimpleDB’s “eventual consistency” makes it more a toy than a reliable database, I don’t care really much; the main issue is with concurrent requests, and which one is served first, to me, makes no difference, as I only have to refresh my window to fetch a new batch of logs to scan.

In the database I’m adding very few attributes of the files: the link to the file itself, the package that it was, the date of the log, and how many matches there have been. My intention is to extend this to show me some legend on what happened: did it fail testing? did it fail the ebuild? are they simply warnings? For now I went with the simplest options though.

To see what’s going on, though, I wrote a completely different script. Sinatra-based, it only provides a single entrypoint on localhost, and gives you the last hundred entries in the SimpleDB which have matches.. I’m going to try making this more extensible in the future as well.

One thing I skipped all over this: to make it easier to actually apply this to different systems, I’m organising the logs by hostname, simply by checking from where the connection is coming (over IPv6, I have access to the reverse DNS for them). This is why I want to introduce threaded responses: quite soon, Excelsior will run some other tinderbox (I’m not yet sure on whether to use AMD64 Hardened or x86 Hardened — x32 is also waiting, to work as third arch), which means that the logs will be merged and “somebody” will have to sift through three copies of them. At least with this system it’s feasible.

Anyway now I guess I’ll sign off, go watch something on Sky, which I’ll probably miss for a couple of weeks when I come back to the US, just the time for me to find a place and get some decent Internet.

Amazon EC2 and old concepts

On Friday I updated my Autotools Mythbuster guide to add support for 2.68 portability notes (all the releases between 2.65 and 2.67 have enough regressions to make them a very bad choice for generic use — for some of those we’ve applied patches that make projects build nonetheless, but those releases should really just disappear from the face of Earth). When I did so, I announced the change on my and then looked at the log, for a personal experiment of mine.

In a matter of a couple of hours, I could see a number of bots coming my way; some simply declared themselves outright (such as StatusNet that checked the link to produce the shortened version), while others tried more or less sophisticated ways to pass themselves for something else. On the other hand it is important to note that many times when a bot declares itself to be something like a browser, it’s simply to get served what the browser would see, for browser-specific hacks are still way too common, but that’s a digression I don’t care about here.

This little experiment of mine was actually aimed at refining my ModSecurity ruleset since I had some extra free time; the results of it are actually already available on the GitHub repository in form of updated blacklists and improved rules. But it made me think about a few more complex problems.

Amazon’s “Elastic Computer Cloud” (or EC2) is an interesting idea to make the best use of all the processing power of modern server hardware; this makes the phrase of a colleague of mine last year, sound even more true (“Recently we faced the return of clustered computing under the new brand of cloud computing, we faced the return of time sharing systems under the software as a service paradigm […]”) when you think of them introducing a “t1.micro” size for EBS-backed instance, for non-CPU-hungry tasks, that can be run with minimal CPU, but need more storage space.

But at the same time, the very design of the EC2 system gets troublesome in many ways; earlier this year I encountered troubles with hostnames when calling back between different EC2 instances, which ended up being resolved by using a dynamic hostname, like we were all used to use at the time of dynamic IP connections such as home ADSL (which for me has been basically till a couple of years ago). A very old technique, almost forgotten by many people, but pretty much necessary here.

It’s not the only thing that EC2 brought back from the time of ADSL though; any service based on it will lack a proper FcRDNS verification, which is very important to make sure that a bot request hasn’t been forged (that is until somebody creates a RobotKeys standard similar to DomainKeys standard), leaving it possible to non-legit bots to pass for legit ones, unless you can actually find a way to discern between the two with deep inspection of the requests. At the same time, it makes it very easy to pass for anything at all, since you can just judge by the User-Agent to find out who is making a request, as the IP address are dynamic and variable.

This situation lead to an obvious conclusion in the area of DNSBL (DNS-based black lists): all of the AWS network block is marked down as a spam source and is thus mostly unable to send email (or in the case of my blog, to post comments). Unfortunately this has a huge disadvantages: Amazon’s own internal network faces Internet from the same netblock, which means that Amazon employers can’t post comments on my blog either.

But the problem doesn’t stop there. As it was, my ruleset cached the result of robots analysis based on IP for a week. This covers the situation pretty nicely for most bots that are hosted on a “classic” system, but for those running on Amazon AWS, the situation is quite different: the same IP address can change “owner” in a matter of minutes, leading to false positives as well as using up an enormous amount of cache entries. To work around this problem, instead of hardcoding the expiration date of any given IP-bound test, I use a transaction variable, which defaults to the previous week, but gets changed to an hour in the case of AWS.

Unfortunately, it seems like EC2 is bringing us back in time, in the time of “real-time block lists” that need to list individual IPs rather than whole netblocks. What’s next, am I going to see again construction signs in websites “under construction”?

Ranting on about EC2

Yes, I’m still fighting with Amazon’s EC2 service for the very same job, and I’m still ranty about it. Maybe I’m too old-school, but I find using the good old virtual servers is much much easier to deal with. It’s not that I cannot see the usefulness of the AWS approach (you can easily try to get something going without sustaining a huge initial investment of capital to get the virtual servers, and you can scale it further on in the working), but I think more than half the interface is just an afterthought, rather than an actual design.

The whole software support for AWS is a bit strange: the original tools, that are available in Portage, are written in Java for the big part, but they don’t seem to be actively versioned and properly released by Amazon themselves, so you actually have to download the tools, then check the version from the directory inside the tarball to know the stable download URL for them (to package them in Gentoo, that is). You can find code to manage AWS services in many languages, including Ruby, for various pieces of it, but you cannot easily find an alternative console if not the ElasticFox extension for Firefox, which I have to say makes me doubt a lot (my Firefox is already slow enough). On the other hand, I actually found some promising command-line utilities in Rudy (which I packaged in Gentoo with a not indifferent effort), but beside some incompatibility with the latest version of the amazon-ec2 gem (which I fixed myself), there are other troubles with it (like not being straightforward how to handle multiple AMIs for different roles, or being impossible to handle snapshot/custom AMI creation through just it). Luckily, the upstream maintainer seems to be around and quite responsive.

Speaking about the libraries, it seems like one of the problems with the various Ruby-based libraries is that one of the most commonly used libraries (RightScale’s right_aws gem) is no longer maintained, or at least upstream has gone missing, and that causes obvious stir in the community. There is a fork for it, that forks the HTTP client library as well (right_http_connection, becoming http_connection — interestingly enough for a single, one line change that I’ve simply patched in on the Gentoo package). The problem is that the fork got worse than the original gem for what packaging is concerned: not only the gem is not providing the documentation, Rakefile, tests and so on, but they are not even tagged in the git repository last I check. Alas.

Luckily, it seems like amazon-ec2 is much better at this job; not that it was pain-free, but even here upstream is available, and fast to release a newer version; the same goes for litc, and the dependencies of the above-mentioned Rudy (see also this blog post from a couple of days ago). This actually make it so that the patches I’m applying, and adding to Gentoo, are deleted or don’t even enter the tree to begin with, which is good for the users who have to sync to keep the size of Portage down to acceptable levels.

Now, back to the EC2 support proper; I already ranted before about the lack of Gentoo support; turns out that there is more support if you go over the American regions, rather than the European one. And at the same time, the European zone seems to have problems: I spent a few days wondering why right_aws failed (and I thought it was because of the bugs that they forked it in the first place), but at the end I had to decide that the problem was with AWS itself: from time to time, a batch of my request fall into oblivion, with errors ranging from “not authorized“ to “instance does not exist” (for something I’m still SSH’d into, by the way). At the end, I decided to move to a different region, US/East, which is where my current customer is doing their tests already.

Now this is not easy either since there is no way to simply ask Amazon to transfer a volume from a given region (or zone) and copy it to another in their own systems (you can use snapshot to recreate a volume within a region on different availability zones, but that’s another problem). The official documentation suggests you to use out-of-band transmission (which, for big volumes, becomes expensive), and in particular the use of sync. Now this wouldn’t have to be too difficult, their suggestion is also to use rsync directly, which would be a good suggestion, if not for one particular. As far as I can tell, the only well-supported community distribution available, with a decently recent kernel (one that works with modern udev, for instance) is Ubuntu; in Ubuntu, you cannot access the root user directly as you all probably well know, and EC2 is no exception (indeed, the copiable command that they give you to connect to your instances is wrong for the Ubuntu case, they explicitly tell you to use the root user, when you have, instead, to use the ubuntu user, but I digress); this also means that you cannot use the root user as either origin or destination of an rsync command (you can sudo -i to get a root session from one or the other side, but not on both, and you need it on both to be able to rsync over the privileged files); okay the solution is definitely easy to find, you just need to tar up the tree you want to transfer, and then scp that over, but it really strikes odd to me that their suggested approach does not work with the only distribution that seems to be updated and supported on their platform.

Now, after the move to the US/East region, problems seems to have disappeared and all commands finally succeeded every time, yuppie! I finally was able to work properly on the code for my project, rather than having to fight with deployment problems (this is why my work is in development and not system administration); after such an ordeal, writing custom queries in PostgreSQL was definitely more fun (no Rails, no ActiveRecord, just pure good old PostgreSQL — okay I’m no DBA either, and sometimes I might have difficulties getting big queries to perform properly, as demonstrated by my work on the collision checker but some simpler and more rational scheme I can deal with pretty nicely). Until I had to make a change to the Gentoo image I was working with, and decided to shut it down, restart Ubuntu, and make the changes to create a new AMI; then hell broke loose.

Turns out that for whatever reason, for all the day yesterday (Wednesday 17th February), after starting Ubuntu instances, with both my usual keypair and a couple of newly-created ones (to exclude a problem with my local setup), the instance would refuse SSH access, claiming “too many authentication failures”. Not sure on the cause, I’ll have to try again tonight and hope that it works as I’m late on delivery already. Interestingly enough, the system log (which only appears one out of ten requests for it from the Amazon console) shows everything as okay, with the sole exception of the Plymouth software that crashes with segmentation fault (code 11) just after the kernel loaded.

So all in all, I think that as soon as this project is completed, and with the exception of eventual future work on this, I will not turn back to Amazon’s EC2 anytime soon; I’ll keep getting normal vservers, with proper Gentoo on them, without hourly fees, with permanent storage and so on so forth (I’ll stick with my current provider as well, even though I’m considering adding a fallback mirror somewhere else to be on the safe side; while my blog’s not that interesting, I have a couple of sites on the vserver that might require me to have higher uptime, but that’s a completely off-topic matter right now).

My first experiences with with Amazon EC2

It really shouldn’t be a surprise for those reading this post that I’ve been tinkering with Amazon EC2 in the past few days, the reason for that is that you can find it out by either looking at my stream or at my commit feed and noticing how I ranted about EC2 and bumped the tools’ packages in Gentoo.

As a first experience with EC2, I have to say it does not really come out very nice… Now, the whole idea of EC2, doesn’t look half as bad. And on the whole I do think the implementation is also not too bad. What is a problem is the Gentoo EC2 guest support: while Amazon “boasts” support for Gentoo as a guest operating system, there is no real support for that out of the box.

*Incidentally, there is another problem: the AWS Web Console that Amazon make available killed my Firefox 3.6 (ground to a halt). I ended up installing Chromium even though it stinks (it stinks less than Opera, at least). It seems pretty much faster, but it’s lacking things like the Delicious sidebar, still. Seems like my previous post wasn’t totally far off. Sure there are extensions now, but as far as I can tell, the only one available for Delicious does not allow you to use Delicious as it was your bookmarks.*

The first problem you might have to affront is finding an image (AMI) to use Gentoo.. I could only find one (at least in the European availability zone), which is … a stage3. Yes a standard Gentoo stage3, without configured network, without SSH, without passwords, … Of course it won’t start. I spent the best part of three hours last night trying to get it to work, and at the end I was told that the only way to work that around is to install using Ubuntu as it was a live CD installing Gentoo on a real system. Fun.

So start up Ubuntu, create an EBS (stable storage) for the Gentoo install, install it as it was a normal chroot, create the snapshot and… here is one strange behaviour of Amazon: when you connect a new EBS volume to an instance, it is created as a block device (say, sdb). When you use a snapshot of that volume to register a machine (AMI), it becomes a partition (sda1). If, like me, you didn’t consider this when setting it up for install, and partitioned it normally, you’ll end up with an unbootable snapshot. Fun ensures.

By the way, to be able to register the machine, you have to do that through recent API tools, more recent than those that were available in Portage today. Hoping that Caleb won’t mind, I bumped them, and also made a couple of changes to the new API/AMI tools ebuilds: they now don’t require you to re-source the environment every time there’s an upgrade, and the avoid polluting /usr/bin full of symlinks.

So you finally complete the install and re-create the AMI, start an instance and… how the heck is it supposed to know your public key? That’s definitely a good question: right now there in Gentoo there is no way for the settings coming from Amazon to e picked up by Gentoo. It’s not difficult, and it seems to be documented as well, but as it is it’s not possible to do that. As I don’t currently need the ability to generate base images, I haven’t gone further pursuing that objective, on the other hand, I have some code that I might commit soonish.

Anyway, if you got interest in having better Gentoo experience on EC2, I might start looking into that more, in the future, as part of my general involvement, so let your voice be heard!