The importance of reliability

In the past seven years I worked at Google as a Site Reliability Engineer. I’m actually leaving the company — I’m writing this during my notice period. I’m currently scheduled to join Facebook as a Production Engineer at the end of May (unless COVID-19 makes things even more complicated). Both roles are related to reliability of services – Google even put it in the name – so you could say I have more than a passing idea of what is involved in maintaining (if not writing) reliable software and services.

I had to learn to be an SRE — this was my first 9-5 job (well, not really 9-5 given that SREs tend to have very flexible and very strange hours), and I hadn’t worked on such a scale before this. And as I wrote before, the job got me used to expect certain things. But it also made me realise how important it is for services and systems to be reliable, as well as secure. And just how much this is not fairly distributed out there.

During my tenure at Google, I’ve been oncall for many different services. Pretty much all of them have been business critical in one way or another — some much more than others. But none of them were critical to society: I’ve never joined the Google Cloud teams, any of the communication teams, or Maps teams. I had been in the Search team, but while it’s definitely important to the business, I think society would rather stay without search results than without a way to contact their loved ones. But that’s just my personal opinion.

The current huge jump in WFH users due to COVID-19 concerns has clearly shown how much more critical to society some of the online services are, that even ten years ago wouldn’t be found as important: Hangouts, Meet, Zoom, Messenger, WhatsApp, and the list goes on. Video calls appear to be the only way to get in touch with our loved ones right now, as well as, for many, the only way to work. Thankfully, most of these services are provided by companies that are big enough to be able to afford reliability in one form or another.

But at least in the UK, this has shown how many other services are clearly critical for society, but not provided by companies who can afford reliability. Online grocery shopping became the thing to do, nearly overnight. Ocado, possibly the biggest grocery delivery company, had had so much pressure on their system that they had to scramble, first introducing a “virtual queue” system, and then eventually taking down the whole website. As I type this, their website has a front page that informs you that the login is only available for those who already have a slot booked for this weekend, and otherwise is not available to anyone — no new slots can be booked.

In similar fashion online retailers, surgery online systems, online prescription services, and banks also appeared to be smothered in requests. I would be surprised if libraries, bookstores, and restaurant websites who don’t rely on the big delivery companies weren’t also affected.

And that had made me sad, and at least in part made me feel ashamed of myself. You see, I have been interviewing at another place, while I was looking for a new job. Not a big multinational company, a smaller one, an utility. And while the offer was very appealing, it was also a more challenging role, and I decided to pass on it. I’m not saying that I’d have made a huge difference for them from any other “trained” SRE, but I do think that a lot of these “smaller” players need their fair dose of reliability.

The problem is that there’s a mixture of different attitude, and actual costs, related to reliability the way Google and the other “bigs” do it. In the case of Google, more often than not the answer to something not working very well is to throw more resources (CPU, memory, storage) at it. That’s not something that you can do quickly when your service is running “on premise” (that is, in your own datacenter cabinet), and not something that you can do cheaply when you run on someone else’s cloud solution.

The thing is, Cloud is not just someone else’s computer. It’s a lot of computers, and it does add a lot of flexibility. And it can even be cheaper than running your own server, sometimes. But it’s also a risk, because if you don’t know when to say “enough”, you end up with budget-wrecking bills. Or sometimes with a problem “downstream”. Take Ocado — the likeliness it that it’s not the website that was being overloaded. It was the fulfillment. Indeed, the virtual queue approach was awesome: it limited the whole human interaction, not just the browser requests. And indeed, the queue worked fine (unlike, say, the CCC ticket queue), and the website didn’t look overloaded at all.

But saying that on-premise equipment does not scale is not trying to market cloud solutions — it’s admitting the truth: if you start getting so many requests at short notice, you can’t go, buy, image, and set up to serve another four or five machines — but you can tell Google Cloud, Amazon, Azure, to go and triple the amount of resources available. And that might or might not make it better for you.

It’s a tradeoff. And not one I have answers for. I can’t say I have experience with managing this tradeoff, either — all the teams I worked on had nearly blank cheques for internal resources (not quite, but nearly), and while resource saving was and is a thing, it never gets to be a real dollar amount that, as an SRE, you end up dealing with. While other companies, particularly smaller companies, need to pay a lot of attention to that.

From my point of view, what I can do is try to be more open with discussing design decisions in my software, particularly when I think it’s my experience talking. I still need to work actively on Tanuga, and I am even considering making a YouTube video of me discussing the way I plan to implement it — as if I was discussing this during a whiteboard design interview (since I have quite a bit of experience with them, this year).

Blog Redirects, Azure Style

Last year, I set up an AppEngine app to redirect the old blog’s URLs to the WordPress install. It’s a relatively simple Flask web application, although it turned out to be around 700 lines of code (quite a bit to just serve redirects). While it ran fine for over a year on Google Cloud without me touching anything, and fitting into the free tier, I had to move it, as part of my divestment from GSuite (which is only vaguely linked to me leaving Google).

I could have just migrated the app on a new consumer account for AppEngine, but I decided to try something different, to avoid the bubble, and to compare other offerings. I decided to try Azure, which is Microsoft’s cloud offering. The first impressions were mixed.

The good thing of the Flask app I used for redirection being that simple is that nothing ties it to any one provider: the only things you need are a Python environment, and the ability to install the requests module. For the same codebase to work on AppEngine and Azure, though, there seems to be a need for a simple change. Both providers appear to rely on Gunicorn, but AppEngine appears to be looking for an object called app in the main module, while Azure is looking for it in the application module. This is trivially solved by defining the whole Flask app inside application.py and having the following content in main.py (the command line support is for my own convenience):

#!/usr/bin/env python3

import argparse

from application import app


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--listen_host', action='store', type=str, default='localhost',
        help='Host to listen on.')
    parser.add_argument(
        '--port', action='store', type=int, default=8080,
        help='Port to listen on.')

    args = parser.parse_args()

    app.run(host=args.listen_host, port=args.port, debug=True)

The next problem I encountered was with the deployment. While there’s plenty of guides out there to use different builders to set up the deployment on Azure, I was lazy and went straight for the most clicky one, which used GitHub Actions to deploy from a (private) GitHub repository straight into Azure, without having to install any command line tools (sweet!) Unfortunately, I hit a snag in the form of what I think is a bug in the Azure GitHub Action template.

You see, the generated workflow for the deployment to Azure is pretty much zipping up the content of the repository, after creating a virtualenv directory to install the requirements defined for it. But while the workflow creates the virtualenv in a directory called env, the default startup script for Azure is looking for it in a directory called antenv. So for me it was failing to start until I changed the workflow to use the latter:

    - name: Install Python dependencies
      run: |
        python3 -m venv antenv
        source antenv/bin/activate
        pip install -r requirements.txt
    - name: Zip the application files
      run: zip -r myapp.zip .

Once that problem was solved, the next issue was to figure out how to set up the app on its original domain and have it serve TLS connections as well. This turned out to be a bit more complicated than expected because I had set up CAA records in my DNS configuration to only allow Let’s Encrypt, but Microsoft uses DigiCert to provide the (short lived) certificates, so until I removed that it wouldn’t be able to issue (oops.)

After everything is set up, here’s a few more of the differences between the two services, that I noticed.

First of all, Azure does not provide IPv6, although since they use CNAME records this can change at any time in the future. This is not a big deal for me, not only because the IPv6 is still dreamland, but also because the redirection would point to WordPress, that does not support IPv6. Nonetheless, it’s an interesting point to make, that despite Microsoft having spent years preparing for IPv6 support, and having even run Teredo tunnels, they also appear to not be ready to provide modern service entrypoints.

Second, and related, it looks like on Azure there’s a DNAT in front of the requests sent to Gunicorn — all the logs show the requests coming from 172.16.0.1 (a private IP address). This is opposite to AppEngine that shows the actual request IP in the log. It’s not a huge deal, but it does make it a bit annoying to figure out if there’s someone trying to attack your hostname. It also makes it funny that it’s not supporting IPv6, given it does not appear to need for the application itself to support the new addresses.

Speaking of logs, GCP exposes structured request logs. This is a pet peeve of mine, which GCP appears to at least make easier to deal with. In general, it allows you to filter logs much more easily to find out instances of requests being terminated with an error status, which is something that I paid close attention to in the weeks after deploying the original AppEngine redirector: I wanted to make sure my rewriting code didn’t miss some corner cases that users were actually hitting.

I couldn’t figure out how to get a similar level of detail in Azure, but honestly I have not tried too hard right now, because I don’t need that level of control for the moment. Also, while there does seem to be an entry in the portal’s menu to query logs, when I try it out I get a message «Register resource provider ‘Microsoft.Insights’ for this subscription to enable this query» which suggests to me it might be a paid extra.

Speaking of paid, the question of costs is something that clearly needs to be kept in clear sight, particularly given recent news cycles. Azure seems to provide a 12 months free trial, but it also gives you £150 of credit for 14 days, which don’t seem to match up properly to me. I’ll update the blog post (or write a new one) with more details after I have some more experience with the system.

I know that someone will comment complaining that I shouldn’t even consider Cloud Computing as a valid option. But honestly, from what I can see, I will be likely running a couple more Cloud applications out there, rather than keep hosting my own websites, and running my own servers. It’s just more practical, and it’s a different trade-off between costs and time spent maintaining thing, so I’m okay with it going this way. But I also want to make sure I don’t end up locking myself into a single provider, with no chance of migrating.

Planets, Clouds, Python

Half a year ago, I wrote some thoughts about writing a cloud-native feed aggregator. I actually started drawing some ideas of how I would design this myself since, and I even went through the (limited) trouble of having it approved for release. But I have not actually released any code, or to be honest, I have not written any code either. The repository has been sitting idle.

Now, with the Python 2 demise coming soon, and me not interested in keeping around a server nearly only to run Planet Multimedia, I started looking into this again. The first thing that I realized is that I both want to reuse as much code exist out there as I can, and I want to integrate with “modern” professional technologies such as OpenTelemetry, which I appreciate from work, even if it sounds like overkill.

But that’s where things get complicated: while going full “left-pad” of having a module for literally everything is not something you’ll find me happy about, a quick look at feedparser, probably the most common module to read feeds in Python, shows just how much code is spent trying to cover for old Python versions (before 2.7, even), or to implement minimal-viable-interfaces to avoid mandatory dependencies at all.

Thankfully, as Samuel from NewsBlur pointed out, it’s relatively trivial to just fetch the feed with requests, and then pass it down to feedparser. And since there are integration points for OpenTelemetry and requests, having an instrumented feed fetcher shouldn’t be too hard. That’s going to probably be my first focus when writing Tanuga, next weekend.

Speaking of NewsBlur, the chat with Samuel also made me realize how much of it is still tied to Python 2. Since I’ve gathered quite a bit of experience in porting to Python 3 at work, I’m trying to find some personal time to contribute smaller fixes to run this in Python 3. The biggest hurdle I’m having right now is to set it up on a VM so that I can start it up in Python 2 to begin with.

Why am I back looking at this pseudo-actively? Well, the main reason is that rawdog is still using Python 2, and that is going to be a major pain with security next year. But it’s also the last non-static website that I run on my own infrastructure, and I really would love to get rid of entirely. Once I do that, I can at least stop running my own (dedicated or virtual) servers. And that’s going to save me time (and money, but time is the most important one here too.)

My hope is that once I find a good solution to migrate Planet Multimedia to a Cloud solution, I can move the remaining static websites to other solutions, likely Netlify like I did for my photography page. And after that, I can stop the last remaining server, and be done with sysadmin work outside of my flat. Because honestly, it’s not worth my time to run all of these.

I can already hear a few folks complaining with the usual remarks of “it’s someone else’s computer!” — but the answer is that yes, it’s someone else’s computer, but a computer of someone who’s paid to do a good job with it. This is possibly the only way for me to manage to cut away some time to work on more Open Source software.

“Planets” in the World of Cloud

As I have written recently, I’m trying to reduce the amount of servers I directly manage, as it’s getting annoying and, honestly, out of touch with what my peers are doing right now. I already hired another company to run the blog for me, although I do keep access to all its information at hand and can migrate where needed. I also give it a try to use Firebase Hosting for my tiny photography page, to see if it would be feasible to replace my homepage with that.

But one of the things that I still definitely need a server for is keep running Planet Multimedia, despite its tiny userbase and dwindling content (if you work in FLOSS multimedia, and you want to be added to the Planet, drop me an email!)

Right now, the Planet is maintained through rawdog, which is a Python script that works locally with no database. This is great to run on a vserver, but in a word where most of the investments and improvements go on Cloud services, that’s not really viable as an option. And to be honest, the fact that this is still using Python 2 worries me no little, particularly when the author insists that Python 3 is a different language (it isn’t).

So, I’m now in the market to replace the Planet Multimedia backend with something that is “Cloud native” — that is, designed to be run on some cloud, and possibly lightweight. I don’t really want to start dealing with Kubernetes, running my own PostgreSQL instances, or setting up Apache. I really would like something that looks more like the redirector I blogged about before, or like the stuff I deal with for a living at work. Because it is 2019.

So sketching this “on paper” very roughly, I expect such a software to be along the lines of a single binary with a configuration file, that outputs static files that are served by the web server. Kind of like rawdog, but long-running. Changing the configuration would require restarting the binary, but that’s acceptable. No database access is really needed, as caching can be maintained to process level — although that would men that permanent redirects couldn’t be rewritten in the configuration. So maybe some configuration database would help, but it seems most clouds support some simple unstructured data storage that would solve that particular problem.

From experience with work, I would expect the long running binary to be itself a webapp, so that you can either inspect (read-only) what’s going on, or make changes to the database configuration with it. And it should probably have independent parallel execution of fetchers for the various feeds, that then store the received content into a shared (in-memory only) structure, that is used by the generation routine to produce the output files. It may sounds like over-engineering the problem, but that’s a bit of a given for me, nowadays.

To be fair, the part that makes me more uneasy of all is authentication, but Identity-Aware Proxy might be a good solution for this. I have not looked into that but used something similar at work.

I’m explicitly ignoring the serving-side problem: serving static files is a problem that has mostly been solved, and I think all cloud providers have some service that allows you to do that.

I’m not sure if I will be able to work more on this, rather than just providing a sketched-out idea. If anyone knows of something like this already, or feels like giving a try to building this, I’d be happy to help (employer-permitting of course). Otherwise, if I find some time to builds stuff like this, I’ll try to get it released as open-source, to build upon.

Blog Redirects & AppEngine

You may remember that when I announced I moved to WordPress, I promised I wouldn’t break any of the old links, particularly as I kept them working since I started running the blog underneath my home office’s desk, on a Gentoo/FreeBSD, just shy of thirteen years ago.

This is not a particularly trivial matter, because Typo used at least three different permalink formats (and two different formats for linking to tags and categories), and Hugo used different ones for all of those too. In addition to this, one of the old Planet aggregators I used to be on had a long-standing bug and truncated URLs to a certain length (actually, two certain lengths, as they extended it at some point), and since those ended up indexed by a number of search engines, I ended up maintaining a long mapping between broken URLs and what they were meant to be.

And once I had such a mapping, I ended up also keeping in it the broken links that other people have created towards my blog. And then when I fixed typos in titles and permalink I also added all of those to the list. And then, …

Oh yeah, and there is the other thing — the original domain of the blog, which I made a redirect for the newest one nearly ten years ago.

The end result is that I have kept holding, for nearly ten years, an unwieldy mod_rewrite configuration for Apache, that also prevented me to migrate to any other web server. Migrating to a new hostname when I migrated to WordPress was always my plan, if nothing else not to have to deal with all those rewrites in the same configuration as the webapp itself.

I have kept, until last week, the same abomination of a configuration, running on the same vserver as the blog used to run. But between stopping relationships with customers (six years ago when I moved to Dublin), moving the blog out, and removing the website of a friend of mine who decided to run his own WordPress, the amount of work needed to maintain the vserver is no longer commensurate to the results.

While discussing my options with a few colleagues, one idea that came out was to just convert the whole thing to a simple Flask application, and run it somewhere. I ended up wanting to try my employer’s own offerings, and ran it on AppEngine (but the app itself does not use any AppEngine specific API, it’s literally just a Flask app).

This meant having the URL mapping in Python, with a bit of regular expression magic to make sure the URL for previous blog engines are replaced with WordPress compatible ones. It also meant that I can have explicit logic of what to re-process and what not to, which with Apache was not easily done (but still possible).

Using an actual programming language instead of Apache configuration also means that I can be a bit smarter on how I process the requests. In particular, before returning the redirect to the requester, I’m now verifying whether the target exists (or rather, whether WordPress returns an OK status for it), and use that to decide whether to return a permanent or temporary redirect. This means that most of the requests to the old URLs will return permanent (308) redirects, and whatever is not found raises a warning I can inspect and see if I should add more entries to the maps.

A very simple UML Sequence Diagram of the redirector, at a high level.

The best part of all of this is of course that the AppEngine app is effectively always below the free tier quota marker, and as such has an effectively zero cost. And even if it wasn’t, the fact that it’s a simple Flask application with no dependency on AppEngine itself means I can move it to any other hosting option that I can afford.

The code is quite of a mess right now, not generic and fairly loose. It has to workaround an annoying Flask issue, and as such it’s not in any state for me to opensource, yet. My plan is to do so as soon as possible, although it might not include the actual URL maps, for the sake of obscurity.

But what is very clear from this for me is that if you want to have a domain whose only task is to redirect to other (static) addresses, like projects hosted off-site, or affiliate links – two things that I have been doing on my primary domain together with the rest of the site, by the way – then the option of using AppEngine and Flask are actually pretty good. You can get that done in a few hours.

Backing up cloud data? Help request.

I’m very fond of backups, after the long series of issues I’ve had before I started doing incremental backups.. I still have got some backup DVDs around, some of which are almost unreadable, and at least one that is compressed with the xar archive in a format that is no longer supported, especially on 64-bit.

Right now, my backups are all managed through rsnapshot, with a bit of custom scripts over it to make sure that if an host is not online, the previous backup is maintained. This works almost perfectly, if you exclude the problems with restored files and the fact that a rename causes files to double, as rsnapshot does not really apply any data de-duplication (and the fdupes program and the like tend to be .. a bit too slow to use on 922GB of data).

But there is one problem that rsnapshot does not really solve: backup of cloud data!

Don’t get me wrong: I do backup the (three) remote servers just fine, but this does not cover the data that is present in remote, “cloud” storage, such as the GitHub, Gitorious and BitBucket repositories, or delicious bookmarks, GMail messages, and so on so forth.

Cloning the bare repositories and backing those up is relatively trivial: it’s a simple script to write. The problem starts with the less “programmatic” services, such as the noted bookmarks and messages. Especially with GMail as copying the whole 3GB of data each time from the server is unlikely to work fine, it has to be done properly.

Has anybody any pointer on the matter? Maybe there’s already a smart backup script, similar to tante’s smart pruning script that can take care of copying the messages via IMAP, for instance…