Ten Years of IPv6, Maybe?

It seems like it’s now a tradition that, once a year, I will be ranting about IPv6. This usually happens because either I’m trying to do something involving IPv6 and I get stumped, or someone finds one of my old blog posts and complains about it. This time is different, to a point. Yes I sometimes throw some of my older post out there, and they receive criticism in the form of “it’s from 2015” – which people think is relevant, but isn’t, since nothing really changed – but the occasion this year is celebrating the ten years anniversary for the World IPv6 Day, the so-called 24-hour test of IPv6 from the big players of network services (including, but not limited to, my current and past employer).

For those who weren’t around or aware of what was going on at the time, this was a one-time event in which a number of companies and sites organized to start publishing AAAA (IPv6) records for their main hostnames for a day. Previously, a number of test hostnames existed, such as ipv6.google.com, so if you wanted to work on IPv6 tech you could, but you had to go and look for it. The whole point of the single day test was to make sure that users wouldn’t notice if they started using the v6 version of the websites — though as history will tell us now, a day was definitely not enough to find that many of the issues around it.

For most of these companies it wasn’t until the following year, on 2012-06-06, that IPv6 “stayed on” on their main domains and hostnames, which should have given enough time to address whatever might have come out of the one day test. For a few, such as OVH, the test looked good enough to keep IPv6 deployed afterwards, and that gave a few of us a preview of the years to come.

I took part to the test day (as well as the launch) — at the time I was exploring options for getting IPv6 working in Italy through tunnels, and I tried a number of different options: Teredo, 6to4, and eventually Hurricane Electric. If you’ve been around enough in those circle you may be confused by my lack of Sixxs as an option — I have encountered their BOFH side of things, and got my account closed for signing up with my Gmail address (that was before I started using Gmail For Business on my own domain). Even when I was told that if I signed up with my Gentoo address I would have had extra credits, I didn’t want to deal with that behaviour, so I just skipped on the option.

So ten years on, what lessons did I learn about IPv6?

It’s A Full Stack World

I’ve had a number of encounters with self-defined Network Engineers, who think that IPv6 just needs to be supported at the network level. If your ISP supports IPv6, you’re good to go. This is just wrong, and shouldn’t even need to be debated, but here we are.

Not only supporting IPv6 requires using slightly different network primitives at times – after all, Gentoo has had an ipv6 USE flag for years – but you need to make sure anything that consumes IP addresses throughout your application knows how to deal with IPv6. For an example, take my old post about Telegram’s IPv6 failures.

As far as I know their issue is solved, but it’s far from uncommon — after all it’s an obvious trick to feed legacy applications a fake IPv4 if you can’t adapt them quickly enough to IPv6. If they’re not actually using it to initiate a connection, but only using it for (short-term) session retrieval or logging, you can get away with this until you replace or lift the backend of a network application. Unfortunately that doesn’t work well when the address is showed back to the user — and the same is true for when the IP needs to be logged for auditing or security purposes: you cannot map arbitrary IPv6 into a 32-bit address space, so while you may be able to provide a temporary session identifier, you would need to have something mapping the session time and the 32-bit identifier together, to match the original source of the request.

Another example of where the difference between IPv4 and IPv6 might cause hard to spot issues is in anonymisation. Now, I’m not a privacy engineer and I won’t suggest that I’ve got a lot of experience in the field, but I have seen attempts at “anonymising” user IPv4s by storing (or showing) only the first three octets of it. Beside the fact that this doesn’t work if you are trying to match up people within a small pool (getting to the ISP would be plenty enough in some of those cases), this does not work with IPv6 at all — you can have 120 of the 128 bits of it and still pretty much being able to identify a single individual.

You’re Only As Prepared As Your Dependencies

This is pretty much a truism in software engineering in general, but it might surprise people that this applies to IPv6 even outside of the “dependencies” you see as part of your code. Many network applications are frontends or backends to something else, and in today’s world, with most things being web applications, this is true for cross-company services too.

When you’re providing a service to one user, but rely on a third party to provide you a service related to that user, it may very well be the case that IPv6 will get in your way. Don’t believe me? Go back and read my story about OVH. What happened there is actually a lot more common than you would think: whether it is payment processors, advertisers, analytics, or other third party backends, it’s not uncommon to assume that you can match the session by the source address and time (although that is always very sketchy, as you may be using Tor, or any other setup where requests to different hosts are routed differently.

Things get even more complicated as time goes by. Let’s take another look at the example of OVH (knowing full well that it was 10 years ago): the problem there was not that the processor didn’t support IPv6 – though it didn’t – the problem was that the communication between OVH (v6) and the payment processor (v4) broke down. It’s perfectly reasonable for the payment processor to request information about the customer that the vendor is sending through, through a back-channel: if the vendor is convinced they’re serving an user in Canada, but the processor is receiving a credit card from Czechia, something smells fishy — and payments are all about risk management after all.

Breaking when using Tor is just as likely, but that can also be perceived as a feature, from the point of view of risk. But when the payment processor cannot understand what the vendor is talking about – because the vendor was talking to you over v6, and passed that information to a processor expecting v4 – you just get a headache, not risk management.

How did this become even more complicated? Well, at least in Europe a lot of payment processors had to implement additional verification through systems such as 3DSecure, Verified By Visa, and whatever Amex calls it. It’s often referred to as Strong Customer Authentication (SCA), and it’s a requirement of the 2nd Payment Service Directive (PSD2), but it has existed for a long while and I remember using it back before World IPv6 Day as well.

With SCA-based systems, a payment processor has pretty much no control on what their full set of dependencies is: each bank provides their own SCA backend, and to the best of my understanding (with the full disclosure that I never worked on payment processing systems), they all just talk to Visa and MasterCard, who then have a registry of which bank’s system to hit to provide further authentication — different banks do this differently, with some risk engine management behind that either approves straight away, or challenges the customer somehow. American Express, as you can imagine, simplifies their own life by being both the network and the issuer.

The Cloud is Vastly V4

This is probably the one place where I’m just as confused as some of the IPv6 enthusiast. Why do neither AWS nor Google Cloud provide IPv6 as an option, for virtual machines, to the best of my knowledge?

If you use “cloud native” solutions, at least on Google Cloud, you do get IPv6, so there’s that. And honestly if you’re going all the way to the cloud, it’s a good design to leverage the provided architecture. But there’s plenty of cases in which you can’t use, say, AppEngine to provide a non-HTTP transport, and having IPv6 available would increase the usability of the protocol.

Now this is interesting because other providers go different ways. Scaleway does indeed provide IPv6 by default (though, not in the best of ways in my experience). It’s actually cheaper to run on IPv6 only — and I guess that if you do use a CDN, you could ask them to provide you a dual-stack frontend while talking to them with an IPv6-only backend, which is very similar to some of the backend networks I have designed in the past, where containers (well before Docker) didn’t actually have IPv4 connectivity out, and they relied on a proxy to provide them with connections to the wide world.

Speaking of CDNs – which are probably not often considered part of Cloud Computing but I will bring them in anyway – I have mused before that it’s funny how a number of websites that use Akamai and other CDNs appear to not support IPv6, despite the fact that the CDNs themselves do provide IPv6-frontend services. I don’t know for sure this is not related to something “silly” such as pricing, but certainly there are more concerns to supporting IPv6 than just “flipping a switch” in the CDN configuration: as I wrote above, there’s definitely full-stack concern with receiving inbound connections coming via IPv6 — even if the service does not need full auditing logs of who’s doing what.

Privacy, Security, and IPv6

If I was to say “IPv6 is a security nightmare”, I’d probably get a divided room — I think there’s a lot of nuance needed to discuss privacy and security about IPv6.

Privacy

First of all, it’s obvious that IPv6 was designed and thought out at a different time than the present, and as such it brought with it some design choices that, looking at them nowadays, look wrong or even laughable. I don’t laugh at them, but I do point out that they were indeed made with a different idea of the world in mind, one that I don’t think is reasonable to keep pining for.

The idea that you can tie up an end-user IPv6 with the physical (MAC) address of an adapter is not something that you would come up with in 2021 — and indeed, IPv6 was retrofitted with at least two proposals for “privacy-preserving” address generation option. After all, the very idea of “fixed” MAC addresses appear to be on the way out — mobile devices started using random MAC addresses tied to specific WiFi networks, to reduce the likeliness of people being tracked between different networks (and thus different locations).

Given that IPv6 is being corrected, you may expect then that the privacy issue is now closed, but I don’t really believe that. The first problem is that there’s no real way to enforce what your equipment inside the network will do from the point of view of the network administrator. Let me try to build a strawman for this — but one that I think is fairly reasonable, as a threat model.

While not every small run manufacturer would go out of their way to be assigned an OUI to give their devices a “branded” MAC address – many of the big ones don’t even do that and leave the MAC provided by the chipset vendor – there’s a few of them who do. I know that we did that at one of my previous customers, where we decided to not only getting an OUI to use for setting the MAC addresses of our devices — but we also used it as serial number for the device itself. And I know we’re not alone.

If some small-run IoT device is shipped with a manufacturer-provided MAC address with their own OUI, it’s likely that the addresses themselves are predictable. They may not quite be sequential, and they probably won’t start from 00:00:01 (they didn’t in our case), but it might be well possible to figure out at least a partial set of addresses that the devices might use.

At that point, if these don’t use a privacy-preserving ephemeral IPv6, it shouldn’t be too hard to “scan” a network for the devices, by calculating the effective IPv6 on the same /64 network from a user request. This is simplified by the fact that, most of the time, ICMP6 is allowed through firewalls — because some of it is needed for operating IPv6 altogether, and way too often even I left stuff more open than I would have liked to. A smart gateway would be able to notice this kind of scans, but… I’m not sure how most routers do with things like this, still. (As it turns out, the default UniFi setup at least seems to configure this correctly.)

There’s another issue — even with privacy extensions, IPv6 addresses are often provided by ISPs in the form of /64 networks. These networks are usually static per subscriber, as if they were static public IPv4, which again is something a number of geeks would like to have… but has also side effects of being able to track a whole household in many cases.

This is possibly controversial with folks, because the move from static addresses to dynamic dialup addresses marks the advent of what some annoying greybeards refer to as Eternal September (if you use that term in the comments, be ready to be moderated away, by the way). But with dynamic addresses came some level of privacy. Of course the ISP could always figure out who was using a certain IP address at a given time, but websites wouldn’t be able to keep users tracked day after day.

Note that the dynamic addresses were not meant to address the need for privacy; it was just incidental. And the fact that you would change addresses often enough that the websites wouldn’t track you was also not by design — it was just expensive to stay connected for long period of times, and even on flat rates you may have needed the phone line to make calls, or you may just have lost connection because of… something. The game (eventually) changed with DSLs (and cable and other similar systems), as they didn’t hold the phone line busy and would be much more stable, and eventually the usage of always-on routers instead of “modems” connected to a single PC brought the whole cycling to a new address a rare occurrence.

Funnily enough, we once again some tidbit of privacy in this space with the advent of carrier-grade NATs (CGNAT), which were once again not designed at all for this. But since they concentrated multiple subscribers (sometimes entire neighbourhoods or towns!) into a single outbound IP address, they would make it harder to tell the person (or even the household) accessing a certain website — unless you are the ISP, clearly. This, by the way, is kind of the same principle that certain VPN providers use nowadays to sell their product as a privacy feature; don’t believe them.

This is not really something that the Internet was really designed — protection against tracking didn’t appear to be one of the worries of an academic network that considered the idea that each workstation would be directly accessible to others, and shared among different users. The world we live in, with user devices that are increasingly single-tenant, and that are not meant to be accessible by anyone else on the network, is different from what the original design of this network was visualizing. And IPv6 carried on with such a design to begin with.

Security

Now on the other hand there’s the problem with actual endpoint security. One of the big issues with network protocols is that firewalls are often designed with one, and only one protocol in mind. Instead of a strawman, this time let me tlak about an episode from my past.

Back when I was in high school, our systems lab was the only laboratory that had a mixed IP (v4, obviously, it was over 16 years ago!) and IPX network, since one of the topics that they were meant to teach us about was how to set up a NetWare network (it was a technical high school). All of the computers were set up with Windows 95, with the little security features that were possible to use, and included a software firewall (I think was ZoneAlarm, but my memory is fuzzy around this point). While I’m sure that trying to disable it or work it around would have worked just as well, most of us decided to not even try: Unreal Tournament and ZSNES worked over IPX just as well, and ZoneAlarm had no idea of what we were doing.

Now, you may want to point out that obviously you should make sure to secure your systems to work for both IP generations anyway. And you’d be generally right. But given that sometimes systems have been lifted and shifted many times, it might very well be that there’s no certainty that a legacy system (OS image, configuration, practice, network, whatever it is) can be safely deployed on a v6 world. If you’re forced to, you’ll probably invest money and time to make sure that it is the case — if you don’t have an absolute need beside the “it’s the Right Thing To Do”, you most likely will try to live as much as you can without it.

This is why I’m not surprised to hear that for many sysadmins out there, disabling IPv6 is part of the standard operating procedure of setting up a system, whether it is a workstation or a server. This is not even helped by the fact that on Linux it’s way easy to forget that ip6tables is different from iptables (and yes I know that this is hopefully changing soon).

Software and Hardware Support

Probably this is the aspect of the IPv6 fan base where I feel myself most at home with. For operating systems and hardware (particularly network hardware) to not support IPv6 in 2021 it feels like we’re being cheated. IPv6 is a great tool for backend networks, as you can avoid a lot of legacy features of IPv4 quickly, and use cascading delegation of prefixes in place of NAT (I’ve done this, multiple times) — so not supporting it at the very basic level is damaging.

Microsoft used to push for IPv6 with a lot of force: among other things, they included Miredo/Teredo in Windows XP (add that to the list of reasons why some professionals still look at IPv6 suspiciously). Unfortunately WSL2 (and as far as I understand HyperV) do not allow using IPv6 for the guests on a Windows 10 workstation. This has gotten in my way a couple of times, because I am otherwise used to just jump around my network with IPv6 addresses (at least before Tailscale).

Similarly, while UniFi works… acceptably well with IPv6, it is still considering it an afterthought, and they are not exactly your average home broadband router either. When even semi-professional network equipment manufacturers can’t get you a good experience out of the box, you do need to start asking yourself some questions.

Indeed, if I could have a v6-only network with NAT64, I might do that. I still believe it is useless and unrealistic, but since I actually do develop software in my spare time, I would like to have a way to test it. It’s the same reason why I own a number of IDN domains. But despite having a lot of options for 802.1x, VLAN segregation, and custom guest hotspots, there’s no trace of NAT64 or other goodies like that.

Indeed, the management software is pretty much only showing you IPv4 addresses for most things, and you need to dig deep to find the correct settings to even allow IPv6 on a network and set it up correctly. Part of the reason is likely that the clients have a lot more weight when it comes to address selection than in v4: while DHCPv6 is a thing, it’s not well supported (still not supported at all on WiFi on Android as far as I know — all thanks to another IPv6 “purist”), and the router advertising and network discovery protocols allow for passive autoconfiguration that, on paper, is so much nicer than the “central authority” of DHCP — but makes it harder to administer “centrally”.

iOS, and Apple products in general, appear to be fond of IPv6. More than Android for sure. But most of my IoT devices are still unable to work on an IPv6-only network. Even ESPHome, which is otherwise an astounding piece of work, does not appear to provide IPv6 endpoints — and I don’t know how much of that is because the hardware acceleration is limited to v4 structures, and how much of it is just because it’s possibly consuming more memory in such small embedded device. The same goes for CircuitPython when using the AirLift FeatherWing.

The folks who gave us the Internet of Things name sold the idea of every device in the world to be connected to the Internet through a unique IPv6 address. This is now a nightmare for many security professionals, a wet dream for certain geeks, but most of all an unrealistic situation that I don’t expect will become reality in my lifetime.

Big Names, Small Sites

As I said at the beginning explaining some of the thinking behind World IPv6 Day and World IPv6 Launch, a number of big names, including Facebook and Google, have put their weight behind IPv6 from early on. Indeed, Google keeps statistics of IPv6 usage with per-country split. Obviously these companies, as well as most of the CDNs, and a number of other big players such as Apple and Netflix, have had time, budget, and engineers to be able to deploy IPv6 far and wide.

But as I have ventured before, I don’t think that they are enough to make a compelling argument for IPv6 only networks. Even when the adoption of IPv6 in addition to IPv4 might make things more convenient for ISPs, the likeliness of being able to drop tout-court IPv4 compatibility is approximately zero, because the sites people actually need are not going to be available over v6 any time soon.

I’m aware of trackers (for example this one, but I remember seeing more of those) that tend to track the IPv6 deployment for “Alexa Top 500” (and similar league tables) websites. But most of the services that average people care about don’t seem to be usually covered by this.

The argument I made in the linked post boils down to this: the day to day of an average person is split between a handful of big name websites (YouTube, Facebook, Twitter), and a plethora of websites that are very sort of global. Irish household providers are never going to make any of the Alexa league tables — and the same is likely true for most other countries that are not China, India, or the United States.

Websites league tables are not usually tracking national services such as ISPs, energy suppliers, mobile providers, banks and other financial institutions, grocery stores, and transport companies. There are other lists that may be more representative here, such as Nielsen Website Ratings, but even those are targeted at selling ad space — and suppliers and banks are not usually interested in that at all.

So instead, I’ve built my own. It’s small, and it mostly only cares about the countries I experienced directly; it’s IPv6 in Real Life. I’ve tried listing a number of services I’m aware of, and should give a better idea of why I think the average person is still not using IPv6 at all, except for the big names we listed above.

There’s another problem with measuring this when resolving hosts (or even connecting to them — which I’m not actually doing in my case). While this easily covers the “storefront” of each service, many use separate additional hosts for accessing logged-in information, such as account data. I’m covering this by providing a list of “additional hosts” for each main service. But while I can notice where the browser is redirected, I would have to go through the whole network traffic to find all the indirect hosts that each site connects to.

Most services, including the big tech companies often have separate hosts that they use to process login requests and similar high-stake forms, rather than using their main domain. Or they may use a different domain for serving static content, maybe even from a CDN. It’s part and parcel of the fact that, for the longest time, we considered hostnames to be a security perimeter. It’s also a side effect of wanting to make it easier to run multiple applications written in widely different technologies — one of my past customers did exactly this using two TLDs: the marketing pages were on a dot-com domain, while the login to the actual application would be on the dot-net one.

Because of this “duality”, and the fact that I’m not really a customer of most of the services I’m tracking, I decided to just look at the “main” domain for them. I guess I could try to aim higher and collect a number of “service domains”, but that would be a point of diminishing returns. I’m going to assume that if the main website (which is likely simpler, or at least with fewer dependencies) does not support IPv6, their service domains don’t, either.

You may noticed that in some cases, smaller companies and groups appear to have better IPv6 deployments. This is not surprising: not only you can audit smaller codebases much faster than the extensive web of dependencies of big companies’ applications — but also the reality of many small businesses is that the system and network administrators do get a bit more time to learn and apply skills, rather than having to follow through a stream of tickets from everyone in the organization that is trying to deploy something, or has a flaky VPN connection.

It also makes it easier for smaller groups to philosophically think of “what’s the Right Thing To Do” versus the medium-to-big company reality of “What does the business get out of spending time and energy on deploying this?” To be fair, it looks like Apple’s IPv6 requirements might have pushed some of the right buttons for that — except for the part where they are not really requiring the services in use by the app to be available on IPv6, it’s acceptable for the app to connect with NAT64 and similar gateways.

Conclusions

I know people paint me as a naysayer — and sometimes I do feel like one. I fear that IPv6 is not going to become the norm during my lifetime, definitely not my career. It is the norm to me, because working for big companies you do end up working with IPv6 anyway. But most users won’t have to care for much longer.

What I want to point out to the IPv6 enthusiast out there is that the road to adoption is harsh, and it won’t get better any time soon. Unless some killer application of IPv6 comes out, where supporting v4 is no longer an option, most smaller players won’t bother. It’s a cost to them, not an advantage.

The performance concerns of YouTube, Netflix, or Facebook will not apply to your average European bank. The annoyance of going through CGNAT that Tailscale experience is not going to be a problem for your smart lightbulb manufacturer who just uses MQTT.

Just saying “It’s the Right Thing To Do” is not going to make it happen. While I applaud those who are actually taking the time to build IPv6-compatible software and hardware, and I think that we actually need more of them taking the pragmatic view of “if not now, when?”, this is going to be a cost. And one that for the most part is not going to benefit the average person in the medium term.

I would like to be wrong, honestly. I would want to wish that next year I’ll magically get firmware updates for everything I have at home and be working with IPv6 — but I don’t think I will. And I don’t think I’ll replace everything I own just because we ran out of address space in IPv4. It would be a horrible waste, to begin with, in the literal sense. The last thing we want, is to tell people to throw away anything that does not speak IPv6, as it would just pile up as e-waste.

Instead I wish that more IPv6 enthusiasts would get to carry the torch of IPv6 while understanding that we’ll live with IPv4 for probably the rest of our lives.

The bakery is just someone else’s oven

Most of the readers of this blog are likely aware of the phrase “The Cloud is someone else’s computer” — sometimes with a “just” added to make it even more judgemental. Well, it should surprise nobody (for those who know me) that I’m not a particular fan of either the phrase, or the sentiment within it.

I think that the current Covid-19 caused quarantine is actually providing a good alternative take on it, which is the title of the blog: “The bakery is just someone else’s oven.”

A lot of people during this pandemic-caused lockdown decided that it’s a good time to make bread, whether they tried it before or not. And many had to deal with similar struggles, from being unable to find ingredients, to making mistakes while reading a recipe that lead to terrible results, to following recipes that are just giving wrong instructions because they assume something that is not spelled out explicitly (my favourite being the recipes that assume you have a static oven — turns out that the oven we have at home only has “fan assisted” and “grill” modes.)

Are we all coming out of the current crisis and deciding that home baking is the only true solution, and that using a bakery is fundamentally immoral? I don’t think so. I’m sure that there will be some extremists thinking that, there always are, but for the most part, the reasonable people will go and accept that baking bread is not easy and while freshly baked bread can taste awesome, for most people, making time to do it every day would cut heavily into time that is needed for other things, that may or may not be more important (work, childcaring, wellness, …), so once the option of just buying good (or at least acceptable) bread from someone else becomes practical, lots of us will go back to buying it most of the time, and just occasionally baking.

I think this matches very well the way Cloud and self-hosted solutions relate to each other. You can set up your own infrastructure to host websites, mail servers, containers, Dockers, apps and whatever else. But most of the time this detracts from the time you would spend on something else, or you may need resources, that might not be linear to procure, or you may make mistakes configuring them, or you may just not know what the right tools you need are. While some (most) of these points apply to using Cloud solutions from various providers, they are also of a different level — the same way you may have problems following a recipe that includes bread as an ingredient, rather than being for bread.

So for instance, if you’re a small business, you may be running your own mail server. It’s very possible that you even already have a system administrator, or retain a “MSP” (Managed Service provider), which is effectively a sysadmin-for-hire. But would it make sense for you to do that? You would have to have a server (virtual or physical) to host it, you would have to manage the security, you would have to manage connectivity, spam tracking, logging, … it’s a lot of work! If you’re not in the business of selling email servers, it’s not likely that you’d ever get a good return on that resource investment. Or you could just pay FastMail to run it for you (see my post on why I chose them at the end).

Of course saying something like this will get people to comment on all kind of corner cases, or risks connected with it. So let’s be sure that I’m not suggesting that this is the only solution. There’s cases in which, even though it is not you primary business, running your own email server has significant advantages, including security. There are threat models and threat models. I think this is once again matching the comparison with bakeries — there are people who can’t buy bread, because the risk of whichever bread they buy to have been contaminated with something they are allergic to is not worth it.

If you think about the resource requirements, there’s another similar situation there: getting all the ingredients in normal situation is very easy, so why bother buying bread? Well, it turns out that the way supply works is not obvious, and definitely not linear. At least in the UK, there was an acknowledgement that “only around 4% of UK flour is sold through shops and supermarket”, and that the problem is not just the increase in demand, but also the significant change in usage pattern.

So while it would be “easy” and “cheap” to host your app on your own servers in normal times, there’s situations in which this is either not possible, or significantly inconvenient, or expensive. What happens when your website is flooded by requests, and your mail server sharing the same bandwidth is unreachable? What happens when a common network zero-day is being exploited, and you need to defend against it as quickly as reasonably possible?

Pricing is another aspect that appears to match fairly well between the two concepts: I heard so many complains about Cloud being so expensive that it will bankrupt you — and they are not entirely wrong. It’s very easy to think of your company as so big as to need huge complex solutions, that will cost a lot more than a perfectly fine solution, either on Cloud or self-hosted. But the price difference shouldn’t be accounted for solely by comparing the price paid for the hosting, versus the price paid for the Cloud products — you need to account for a lot more costs, related to management and resources.

Managing servers take time and energy, and it increases risks, which is why a lot of business managers tend to prefer outsourcing that to a cloud provider. It’s pretty much always the case, anyway, that some of the services are outsourced. For instance, even the people insisting on self-hosting solutions don’t usually go as far as “go and rent your own cabinet at a co-location facility!” which would still be a step short of running your own fiber optic straight from the Internet exchange, and… well you get my point. In the same spirit, the price of bread does not only include the price of the ingredients but also the time that is spent to prepare it, knead it, bake it, and so on. And similarly, even the people who I have heard arguing that baking every day is better than buying store bread don’t seem to suggest you should go and mill your own flour.

What about the recipes? Well, I’m sure you can find a lot of recipes in old cookbooks with ingredients that are nowadays understood as unhealthy — and not in the “healthy cooking” kind of way only, either. The same way as you can definitely find old tutorials that provide bad advice, while a professional baker would know how to do this correctly.

At the end of the day, I’m sure I’m not going to change the mind of anyone, but I just wanted to have something to point people at the next time they try the “no cloud” talk on me. And a catchy phrase to answer them by.

The importance of reliability

In the past seven years I worked at Google as a Site Reliability Engineer. I’m actually leaving the company — I’m writing this during my notice period. I’m currently scheduled to join Facebook as a Production Engineer at the end of May (unless COVID-19 makes things even more complicated). Both roles are related to reliability of services – Google even put it in the name – so you could say I have more than a passing idea of what is involved in maintaining (if not writing) reliable software and services.

I had to learn to be an SRE — this was my first 9-5 job (well, not really 9-5 given that SREs tend to have very flexible and very strange hours), and I hadn’t worked on such a scale before this. And as I wrote before, the job got me used to expect certain things. But it also made me realise how important it is for services and systems to be reliable, as well as secure. And just how much this is not fairly distributed out there.

During my tenure at Google, I’ve been oncall for many different services. Pretty much all of them have been business critical in one way or another — some much more than others. But none of them were critical to society: I’ve never joined the Google Cloud teams, any of the communication teams, or Maps teams. I had been in the Search team, but while it’s definitely important to the business, I think society would rather stay without search results than without a way to contact their loved ones. But that’s just my personal opinion.

The current huge jump in WFH users due to COVID-19 concerns has clearly shown how much more critical to society some of the online services are, that even ten years ago wouldn’t be found as important: Hangouts, Meet, Zoom, Messenger, WhatsApp, and the list goes on. Video calls appear to be the only way to get in touch with our loved ones right now, as well as, for many, the only way to work. Thankfully, most of these services are provided by companies that are big enough to be able to afford reliability in one form or another.

But at least in the UK, this has shown how many other services are clearly critical for society, but not provided by companies who can afford reliability. Online grocery shopping became the thing to do, nearly overnight. Ocado, possibly the biggest grocery delivery company, had had so much pressure on their system that they had to scramble, first introducing a “virtual queue” system, and then eventually taking down the whole website. As I type this, their website has a front page that informs you that the login is only available for those who already have a slot booked for this weekend, and otherwise is not available to anyone — no new slots can be booked.

In similar fashion online retailers, surgery online systems, online prescription services, and banks also appeared to be smothered in requests. I would be surprised if libraries, bookstores, and restaurant websites who don’t rely on the big delivery companies weren’t also affected.

And that had made me sad, and at least in part made me feel ashamed of myself. You see, I have been interviewing at another place, while I was looking for a new job. Not a big multinational company, a smaller one, an utility. And while the offer was very appealing, it was also a more challenging role, and I decided to pass on it. I’m not saying that I’d have made a huge difference for them from any other “trained” SRE, but I do think that a lot of these “smaller” players need their fair dose of reliability.

The problem is that there’s a mixture of different attitude, and actual costs, related to reliability the way Google and the other “bigs” do it. In the case of Google, more often than not the answer to something not working very well is to throw more resources (CPU, memory, storage) at it. That’s not something that you can do quickly when your service is running “on premise” (that is, in your own datacenter cabinet), and not something that you can do cheaply when you run on someone else’s cloud solution.

The thing is, Cloud is not just someone else’s computer. It’s a lot of computers, and it does add a lot of flexibility. And it can even be cheaper than running your own server, sometimes. But it’s also a risk, because if you don’t know when to say “enough”, you end up with budget-wrecking bills. Or sometimes with a problem “downstream”. Take Ocado — the likeliness it that it’s not the website that was being overloaded. It was the fulfillment. Indeed, the virtual queue approach was awesome: it limited the whole human interaction, not just the browser requests. And indeed, the queue worked fine (unlike, say, the CCC ticket queue), and the website didn’t look overloaded at all.

But saying that on-premise equipment does not scale is not trying to market cloud solutions — it’s admitting the truth: if you start getting so many requests at short notice, you can’t go, buy, image, and set up to serve another four or five machines — but you can tell Google Cloud, Amazon, Azure, to go and triple the amount of resources available. And that might or might not make it better for you.

It’s a tradeoff. And not one I have answers for. I can’t say I have experience with managing this tradeoff, either — all the teams I worked on had nearly blank cheques for internal resources (not quite, but nearly), and while resource saving was and is a thing, it never gets to be a real dollar amount that, as an SRE, you end up dealing with. While other companies, particularly smaller companies, need to pay a lot of attention to that.

From my point of view, what I can do is try to be more open with discussing design decisions in my software, particularly when I think it’s my experience talking. I still need to work actively on Tanuga, and I am even considering making a YouTube video of me discussing the way I plan to implement it — as if I was discussing this during a whiteboard design interview (since I have quite a bit of experience with them, this year).

Blog Redirects, Azure Style

Last year, I set up an AppEngine app to redirect the old blog’s URLs to the WordPress install. It’s a relatively simple Flask web application, although it turned out to be around 700 lines of code (quite a bit to just serve redirects). While it ran fine for over a year on Google Cloud without me touching anything, and fitting into the free tier, I had to move it, as part of my divestment from GSuite (which is only vaguely linked to me leaving Google).

I could have just migrated the app on a new consumer account for AppEngine, but I decided to try something different, to avoid the bubble, and to compare other offerings. I decided to try Azure, which is Microsoft’s cloud offering. The first impressions were mixed.

The good thing of the Flask app I used for redirection being that simple is that nothing ties it to any one provider: the only things you need are a Python environment, and the ability to install the requests module. For the same codebase to work on AppEngine and Azure, though, there seems to be a need for a simple change. Both providers appear to rely on Gunicorn, but AppEngine appears to be looking for an object called app in the main module, while Azure is looking for it in the application module. This is trivially solved by defining the whole Flask app inside application.py and having the following content in main.py (the command line support is for my own convenience):

#!/usr/bin/env python3

import argparse

from application import app


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--listen_host', action='store', type=str, default='localhost',
        help='Host to listen on.')
    parser.add_argument(
        '--port', action='store', type=int, default=8080,
        help='Port to listen on.')

    args = parser.parse_args()

    app.run(host=args.listen_host, port=args.port, debug=True)

The next problem I encountered was with the deployment. While there’s plenty of guides out there to use different builders to set up the deployment on Azure, I was lazy and went straight for the most clicky one, which used GitHub Actions to deploy from a (private) GitHub repository straight into Azure, without having to install any command line tools (sweet!) Unfortunately, I hit a snag in the form of what I think is a bug in the Azure GitHub Action template.

You see, the generated workflow for the deployment to Azure is pretty much zipping up the content of the repository, after creating a virtualenv directory to install the requirements defined for it. But while the workflow creates the virtualenv in a directory called env, the default startup script for Azure is looking for it in a directory called antenv. So for me it was failing to start until I changed the workflow to use the latter:

    - name: Install Python dependencies
      run: |
        python3 -m venv antenv
        source antenv/bin/activate
        pip install -r requirements.txt
    - name: Zip the application files
      run: zip -r myapp.zip .

Once that problem was solved, the next issue was to figure out how to set up the app on its original domain and have it serve TLS connections as well. This turned out to be a bit more complicated than expected because I had set up CAA records in my DNS configuration to only allow Let’s Encrypt, but Microsoft uses DigiCert to provide the (short lived) certificates, so until I removed that it wouldn’t be able to issue (oops.)

After everything is set up, here’s a few more of the differences between the two services, that I noticed.

First of all, Azure does not provide IPv6, although since they use CNAME records this can change at any time in the future. This is not a big deal for me, not only because the IPv6 is still dreamland, but also because the redirection would point to WordPress, that does not support IPv6. Nonetheless, it’s an interesting point to make, that despite Microsoft having spent years preparing for IPv6 support, and having even run Teredo tunnels, they also appear to not be ready to provide modern service entrypoints.

Second, and related, it looks like on Azure there’s a DNAT in front of the requests sent to Gunicorn — all the logs show the requests coming from 172.16.0.1 (a private IP address). This is opposite to AppEngine that shows the actual request IP in the log. It’s not a huge deal, but it does make it a bit annoying to figure out if there’s someone trying to attack your hostname. It also makes it funny that it’s not supporting IPv6, given it does not appear to need for the application itself to support the new addresses.

Speaking of logs, GCP exposes structured request logs. This is a pet peeve of mine, which GCP appears to at least make easier to deal with. In general, it allows you to filter logs much more easily to find out instances of requests being terminated with an error status, which is something that I paid close attention to in the weeks after deploying the original AppEngine redirector: I wanted to make sure my rewriting code didn’t miss some corner cases that users were actually hitting.

I couldn’t figure out how to get a similar level of detail in Azure, but honestly I have not tried too hard right now, because I don’t need that level of control for the moment. Also, while there does seem to be an entry in the portal’s menu to query logs, when I try it out I get a message «Register resource provider ‘Microsoft.Insights’ for this subscription to enable this query» which suggests to me it might be a paid extra.

Speaking of paid, the question of costs is something that clearly needs to be kept in clear sight, particularly given recent news cycles. Azure seems to provide a 12 months free trial, but it also gives you £150 of credit for 14 days, which don’t seem to match up properly to me. I’ll update the blog post (or write a new one) with more details after I have some more experience with the system.

I know that someone will comment complaining that I shouldn’t even consider Cloud Computing as a valid option. But honestly, from what I can see, I will be likely running a couple more Cloud applications out there, rather than keep hosting my own websites, and running my own servers. It’s just more practical, and it’s a different trade-off between costs and time spent maintaining thing, so I’m okay with it going this way. But I also want to make sure I don’t end up locking myself into a single provider, with no chance of migrating.

Planets, Clouds, Python

Half a year ago, I wrote some thoughts about writing a cloud-native feed aggregator. I actually started drawing some ideas of how I would design this myself since, and I even went through the (limited) trouble of having it approved for release. But I have not actually released any code, or to be honest, I have not written any code either. The repository has been sitting idle.

Now, with the Python 2 demise coming soon, and me not interested in keeping around a server nearly only to run Planet Multimedia, I started looking into this again. The first thing that I realized is that I both want to reuse as much code exist out there as I can, and I want to integrate with “modern” professional technologies such as OpenTelemetry, which I appreciate from work, even if it sounds like overkill.

But that’s where things get complicated: while going full “left-pad” of having a module for literally everything is not something you’ll find me happy about, a quick look at feedparser, probably the most common module to read feeds in Python, shows just how much code is spent trying to cover for old Python versions (before 2.7, even), or to implement minimal-viable-interfaces to avoid mandatory dependencies at all.

Thankfully, as Samuel from NewsBlur pointed out, it’s relatively trivial to just fetch the feed with requests, and then pass it down to feedparser. And since there are integration points for OpenTelemetry and requests, having an instrumented feed fetcher shouldn’t be too hard. That’s going to probably be my first focus when writing Tanuga, next weekend.

Speaking of NewsBlur, the chat with Samuel also made me realize how much of it is still tied to Python 2. Since I’ve gathered quite a bit of experience in porting to Python 3 at work, I’m trying to find some personal time to contribute smaller fixes to run this in Python 3. The biggest hurdle I’m having right now is to set it up on a VM so that I can start it up in Python 2 to begin with.

Why am I back looking at this pseudo-actively? Well, the main reason is that rawdog is still using Python 2, and that is going to be a major pain with security next year. But it’s also the last non-static website that I run on my own infrastructure, and I really would love to get rid of entirely. Once I do that, I can at least stop running my own (dedicated or virtual) servers. And that’s going to save me time (and money, but time is the most important one here too.)

My hope is that once I find a good solution to migrate Planet Multimedia to a Cloud solution, I can move the remaining static websites to other solutions, likely Netlify like I did for my photography page. And after that, I can stop the last remaining server, and be done with sysadmin work outside of my flat. Because honestly, it’s not worth my time to run all of these.

I can already hear a few folks complaining with the usual remarks of “it’s someone else’s computer!” — but the answer is that yes, it’s someone else’s computer, but a computer of someone who’s paid to do a good job with it. This is possibly the only way for me to manage to cut away some time to work on more Open Source software.

“Planets” in the World of Cloud

As I have written recently, I’m trying to reduce the amount of servers I directly manage, as it’s getting annoying and, honestly, out of touch with what my peers are doing right now. I already hired another company to run the blog for me, although I do keep access to all its information at hand and can migrate where needed. I also give it a try to use Firebase Hosting for my tiny photography page, to see if it would be feasible to replace my homepage with that.

But one of the things that I still definitely need a server for is keep running Planet Multimedia, despite its tiny userbase and dwindling content (if you work in FLOSS multimedia, and you want to be added to the Planet, drop me an email!)

Right now, the Planet is maintained through rawdog, which is a Python script that works locally with no database. This is great to run on a vserver, but in a word where most of the investments and improvements go on Cloud services, that’s not really viable as an option. And to be honest, the fact that this is still using Python 2 worries me no little, particularly when the author insists that Python 3 is a different language (it isn’t).

So, I’m now in the market to replace the Planet Multimedia backend with something that is “Cloud native” — that is, designed to be run on some cloud, and possibly lightweight. I don’t really want to start dealing with Kubernetes, running my own PostgreSQL instances, or setting up Apache. I really would like something that looks more like the redirector I blogged about before, or like the stuff I deal with for a living at work. Because it is 2019.

So sketching this “on paper” very roughly, I expect such a software to be along the lines of a single binary with a configuration file, that outputs static files that are served by the web server. Kind of like rawdog, but long-running. Changing the configuration would require restarting the binary, but that’s acceptable. No database access is really needed, as caching can be maintained to process level — although that would men that permanent redirects couldn’t be rewritten in the configuration. So maybe some configuration database would help, but it seems most clouds support some simple unstructured data storage that would solve that particular problem.

From experience with work, I would expect the long running binary to be itself a webapp, so that you can either inspect (read-only) what’s going on, or make changes to the database configuration with it. And it should probably have independent parallel execution of fetchers for the various feeds, that then store the received content into a shared (in-memory only) structure, that is used by the generation routine to produce the output files. It may sounds like over-engineering the problem, but that’s a bit of a given for me, nowadays.

To be fair, the part that makes me more uneasy of all is authentication, but Identity-Aware Proxy might be a good solution for this. I have not looked into that but used something similar at work.

I’m explicitly ignoring the serving-side problem: serving static files is a problem that has mostly been solved, and I think all cloud providers have some service that allows you to do that.

I’m not sure if I will be able to work more on this, rather than just providing a sketched-out idea. If anyone knows of something like this already, or feels like giving a try to building this, I’d be happy to help (employer-permitting of course). Otherwise, if I find some time to builds stuff like this, I’ll try to get it released as open-source, to build upon.

Blog Redirects & AppEngine

You may remember that when I announced I moved to WordPress, I promised I wouldn’t break any of the old links, particularly as I kept them working since I started running the blog underneath my home office’s desk, on a Gentoo/FreeBSD, just shy of thirteen years ago.

This is not a particularly trivial matter, because Typo used at least three different permalink formats (and two different formats for linking to tags and categories), and Hugo used different ones for all of those too. In addition to this, one of the old Planet aggregators I used to be on had a long-standing bug and truncated URLs to a certain length (actually, two certain lengths, as they extended it at some point), and since those ended up indexed by a number of search engines, I ended up maintaining a long mapping between broken URLs and what they were meant to be.

And once I had such a mapping, I ended up also keeping in it the broken links that other people have created towards my blog. And then when I fixed typos in titles and permalink I also added all of those to the list. And then, …

Oh yeah, and there is the other thing — the original domain of the blog, which I made a redirect for the newest one nearly ten years ago.

The end result is that I have kept holding, for nearly ten years, an unwieldy mod_rewrite configuration for Apache, that also prevented me to migrate to any other web server. Migrating to a new hostname when I migrated to WordPress was always my plan, if nothing else not to have to deal with all those rewrites in the same configuration as the webapp itself.

I have kept, until last week, the same abomination of a configuration, running on the same vserver as the blog used to run. But between stopping relationships with customers (six years ago when I moved to Dublin), moving the blog out, and removing the website of a friend of mine who decided to run his own WordPress, the amount of work needed to maintain the vserver is no longer commensurate to the results.

While discussing my options with a few colleagues, one idea that came out was to just convert the whole thing to a simple Flask application, and run it somewhere. I ended up wanting to try my employer’s own offerings, and ran it on AppEngine (but the app itself does not use any AppEngine specific API, it’s literally just a Flask app).

This meant having the URL mapping in Python, with a bit of regular expression magic to make sure the URL for previous blog engines are replaced with WordPress compatible ones. It also meant that I can have explicit logic of what to re-process and what not to, which with Apache was not easily done (but still possible).

Using an actual programming language instead of Apache configuration also means that I can be a bit smarter on how I process the requests. In particular, before returning the redirect to the requester, I’m now verifying whether the target exists (or rather, whether WordPress returns an OK status for it), and use that to decide whether to return a permanent or temporary redirect. This means that most of the requests to the old URLs will return permanent (308) redirects, and whatever is not found raises a warning I can inspect and see if I should add more entries to the maps.

A very simple UML Sequence Diagram of the redirector, at a high level.

The best part of all of this is of course that the AppEngine app is effectively always below the free tier quota marker, and as such has an effectively zero cost. And even if it wasn’t, the fact that it’s a simple Flask application with no dependency on AppEngine itself means I can move it to any other hosting option that I can afford.

The code is quite of a mess right now, not generic and fairly loose. It has to workaround an annoying Flask issue, and as such it’s not in any state for me to opensource, yet. My plan is to do so as soon as possible, although it might not include the actual URL maps, for the sake of obscurity.

But what is very clear from this for me is that if you want to have a domain whose only task is to redirect to other (static) addresses, like projects hosted off-site, or affiliate links – two things that I have been doing on my primary domain together with the rest of the site, by the way – then the option of using AppEngine and Flask are actually pretty good. You can get that done in a few hours.

Backing up cloud data? Help request.

I’m very fond of backups, after the long series of issues I’ve had before I started doing incremental backups.. I still have got some backup DVDs around, some of which are almost unreadable, and at least one that is compressed with the xar archive in a format that is no longer supported, especially on 64-bit.

Right now, my backups are all managed through rsnapshot, with a bit of custom scripts over it to make sure that if an host is not online, the previous backup is maintained. This works almost perfectly, if you exclude the problems with restored files and the fact that a rename causes files to double, as rsnapshot does not really apply any data de-duplication (and the fdupes program and the like tend to be .. a bit too slow to use on 922GB of data).

But there is one problem that rsnapshot does not really solve: backup of cloud data!

Don’t get me wrong: I do backup the (three) remote servers just fine, but this does not cover the data that is present in remote, “cloud” storage, such as the GitHub, Gitorious and BitBucket repositories, or delicious bookmarks, GMail messages, and so on so forth.

Cloning the bare repositories and backing those up is relatively trivial: it’s a simple script to write. The problem starts with the less “programmatic” services, such as the noted bookmarks and messages. Especially with GMail as copying the whole 3GB of data each time from the server is unlikely to work fine, it has to be done properly.

Has anybody any pointer on the matter? Maybe there’s already a smart backup script, similar to tante’s smart pruning script that can take care of copying the messages via IMAP, for instance…