Yes, I’m still fighting with Amazon’s EC2 service for the very same job, and I’m still ranty about it. Maybe I’m too old-school, but I find using the good old virtual servers is much much easier to deal with. It’s not that I cannot see the usefulness of the AWS approach (you can easily try to get something going without sustaining a huge initial investment of capital to get the virtual servers, and you can scale it further on in the working), but I think more than half the interface is just an afterthought, rather than an actual design.
The whole software support for AWS is a bit strange: the original tools, that are available in Portage, are written in Java for the big part, but they don’t seem to be actively versioned and properly released by Amazon themselves, so you actually have to download the tools, then check the version from the directory inside the tarball to know the stable download URL for them (to package them in Gentoo, that is). You can find code to manage AWS services in many languages, including Ruby, for various pieces of it, but you cannot easily find an alternative console if not the ElasticFox extension for Firefox, which I have to say makes me doubt a lot (my Firefox is already slow enough). On the other hand, I actually found some promising command-line utilities in Rudy (which I packaged in Gentoo with a not indifferent effort), but beside some incompatibility with the latest version of the amazon-ec2 gem (which I fixed myself), there are other troubles with it (like not being straightforward how to handle multiple AMIs for different roles, or being impossible to handle snapshot/custom AMI creation through just it). Luckily, the upstream maintainer seems to be around and quite responsive.
Speaking about the libraries, it seems like one of the problems with the various Ruby-based libraries is that one of the most commonly used libraries (RightScale’s right_aws gem) is no longer maintained, or at least upstream has gone missing, and that causes obvious stir in the community. There is a fork for it, that forks the HTTP client library as well (right_http_connection, becoming http_connection — interestingly enough for a single, one line change that I’ve simply patched in on the Gentoo package). The problem is that the fork got worse than the original gem for what packaging is concerned: not only the gem is not providing the documentation, Rakefile, tests and so on, but they are not even tagged in the git repository last I check. Alas.
Luckily, it seems like amazon-ec2 is much better at this job; not that it was pain-free, but even here upstream is available, and fast to release a newer version; the same goes for litc, and the dependencies of the above-mentioned Rudy (see also this blog post from a couple of days ago). This actually make it so that the patches I’m applying, and adding to Gentoo, are deleted or don’t even enter the tree to begin with, which is good for the users who have to sync to keep the size of Portage down to acceptable levels.
Now, back to the EC2 support proper; I already ranted before about the lack of Gentoo support; turns out that there is more support if you go over the American regions, rather than the European one. And at the same time, the European zone seems to have problems: I spent a few days wondering why right_aws failed (and I thought it was because of the bugs that they forked it in the first place), but at the end I had to decide that the problem was with AWS itself: from time to time, a batch of my request fall into oblivion, with errors ranging from “not authorized“ to “instance does not exist” (for something I’m still SSH’d into, by the way). At the end, I decided to move to a different region, US/East, which is where my current customer is doing their tests already.
Now this is not easy either since there is no way to simply ask Amazon to transfer a volume from a given region (or zone) and copy it to another in their own systems (you can use snapshot to recreate a volume within a region on different availability zones, but that’s another problem). The official documentation suggests you to use out-of-band transmission (which, for big volumes, becomes expensive), and in particular the use of sync. Now this wouldn’t have to be too difficult, their suggestion is also to use
rsync directly, which would be a good suggestion, if not for one particular. As far as I can tell, the only well-supported community distribution available, with a decently recent kernel (one that works with modern udev, for instance) is Ubuntu; in Ubuntu, you cannot access the root user directly as you all probably well know, and EC2 is no exception (indeed, the copiable command that they give you to connect to your instances is wrong for the Ubuntu case, they explicitly tell you to use the root user, when you have, instead, to use the ubuntu user, but I digress); this also means that you cannot use the root user as either origin or destination of an
rsync command (you can
sudo -i to get a root session from one or the other side, but not on both, and you need it on both to be able to
rsync over the privileged files); okay the solution is definitely easy to find, you just need to
tar up the tree you want to transfer, and then scp that over, but it really strikes odd to me that their suggested approach does not work with the only distribution that seems to be updated and supported on their platform.
Now, after the move to the US/East region, problems seems to have disappeared and all commands finally succeeded every time, yuppie! I finally was able to work properly on the code for my project, rather than having to fight with deployment problems (this is why my work is in development and not system administration); after such an ordeal, writing custom queries in PostgreSQL was definitely more fun (no Rails, no ActiveRecord, just pure good old PostgreSQL — okay I’m no DBA either, and sometimes I might have difficulties getting big queries to perform properly, as demonstrated by my work on the collision checker but some simpler and more rational scheme I can deal with pretty nicely). Until I had to make a change to the Gentoo image I was working with, and decided to shut it down, restart Ubuntu, and make the changes to create a new AMI; then hell broke loose.
Turns out that for whatever reason, for all the day yesterday (Wednesday 17th February), after starting Ubuntu instances, with both my usual keypair and a couple of newly-created ones (to exclude a problem with my local setup), the instance would refuse SSH access, claiming “too many authentication failures”. Not sure on the cause, I’ll have to try again tonight and hope that it works as I’m late on delivery already. Interestingly enough, the system log (which only appears one out of ten requests for it from the Amazon console) shows everything as okay, with the sole exception of the Plymouth software that crashes with segmentation fault (code 11) just after the kernel loaded.
So all in all, I think that as soon as this project is completed, and with the exception of eventual future work on this, I will not turn back to Amazon’s EC2 anytime soon; I’ll keep getting normal vservers, with proper Gentoo on them, without hourly fees, with permanent storage and so on so forth (I’ll stick with my current provider as well, even though I’m considering adding a fallback mirror somewhere else to be on the safe side; while my blog’s not that interesting, I have a couple of sites on the vserver that might require me to have higher uptime, but that’s a completely off-topic matter right now).