You may remember that two years ago I decided to self-host WordPress on Gandi after Automattic made it impossible for me to write blog posts on my normal Windows setup. While this setup mostly worked fine over the past couple of years, I have been hitting some of its limitation from time to time, so I eventually decided to go from the half-managed setup with Gandi to a fully self-hosted setup with Hetzner, and since this is now completed as of this past Friday, I thought I would at least provide some reasons and some updates related to it as a reference for others and for the future.
The first question I want to answer is, why did I bother with moving infrastructure? Particularly given my personal dislike of self-hosted solutions, deciding to take on even more personal infrastructure is not a consistent action, so I want to share my reasoning to bite the bullet and do something so out of character for myself. Unfortunately I don’t have a simple, one-line slogan that can be used as reason for me to do this, and it’s instead a fairly complex, nuanced situation.
One of the components for my motivation is that my overall excitement for Gandi has been cooling down since they have been acquired. While I’m not rushing out of the door with my domains just for the acquisition, I have been making up plans to ensure I wouldn’t be caught unable to migrate due to hard dependency on the Gandi system. As it turns out, there’s a couple of domains I can’t easily move away from Gandi, which I’m not particularly happy about, but can live with.
Another motivation is that there has been always an open question for me when it came to the performance of the Simple Hosting Solution that Gandi offered. When I migrated WordPress, I decided to go for the cheapest option that Gandi provided — and that for the most part worked perfectly fine for the blog as a reader, but made it fairly difficult to upload media and have the thumbnails be generated as intended. Unfortunately I could never get a straight answer from Gandi support (even before acquisition) on whether paying for the larger storage option would have increased the share of CPU and memory available. And since the option of running this on a Hetzner ARM server (with shared CPU and RAM) was just cheaper than the Gandi upgrade, for more RAM and CPU, that was an obvious alternative take.
More to the point on avoiding a hard dependency on the new Gandi, there is no easy way to automatically backup a WordPress install running on their Simple Hosting Server. Automattic hides this under the Jetpack backup feature, and I already opined on their offerings. And while you can enable an “emergency console” through the Gandi interface, this requires a human to click buttons, and does not include the ability to copy files via SCP, let alone to take a MySQL backup! Which means in case of a crashing-and-burning of Gandi, I had no viable recourse to maintain a continuity for the blog.
To make things more complicated, in the past few weeks the uptime performance of the blog – which I started recording via Uptime Robot (affiliate link) – has been, well not bad, but not perfect. It seems like every few days Gandi had a 15 minutes blip (which I never saw previously), and a couple of weeks ago after 22 minutes it was still not coming back up. I could see the frontend server couldn’t talk to the PHP-FPM runner, but according to Gandi everything was fine. I managed some percussive maintenance by upgrading the PHP version but the lack of a “restart” button didn’t sit very well with me.
Finally, I’ve been experiencing a number of problems with slow-down in the Block Editor when going into “long form” for a blog post. While these appear to be client-side, as I don’t see anything in the Chrome network monitor, they persist no matter the operating system (including Android) and browser (including Firefox), which have me wonder if they have been caused by the slow asynchronous requests to save the draft, which would point at a bottleneck on the hosting itself.
Blockers And Preparation
When Gandi was acquired, I had a feeling that this time would come, either because of the acquisition leading me to run away from them for some of their decisions, or because of infrastructure falling apart. Which meant back then I already started accepting I would have to run my own infrastructure at some point.
At that point I decided not just to dismiss my minuscule server at Scaleway (in favour of Hetzner due to costs), but also to learn to use Docker (which I barely ever used in the past), Prometheus (which thankfully resembles enough the old Google internal monitoring system), and Grafana. With the help of Alex, Srdjan and Luke, I managed to get most of the system set up over a weekend, and have both the blog redirector and IPv6 in Real Life running and served in a jiffy.
This was useful when, a couple of months back, I realized that Netlify could build neither my homepage nor my Autotools Mythbusters since they both used the old style, now deprecated, Ruby Sass compiler, and the latter couldn’t even run on Netlify’s modern Ubuntu image as it relies on the namespaced DocBook stylesheets. The end result of that is that I’m now building a Docker image for Autotools Mythbuster builds (based on Ubuntu, because Alpine does not package those namespaced stylesheets in the first place), and fetching and updating the site on a hourly schedule.
In addition to Grafana, I also needed Prometheus to be able to send me alerts. I decided to set it up with AlertManager, and have it send me alerts over Telegram, of all places, because their setup for bots to notify you is just too easy to ignore. Despite three years working on WhatsApp, I know that setting it up right for notifications is currently not an easy task, even though I would definitely have preferred it.
As finding the right alerts for the right metrics, and the right syntax to set them up, turned out to be particularly difficult for me “cold turkey”, this time I decided to pinch my nose, and go with an option I’m not particularly happy about… I “asked” help to ChatGPT. Turns out that it was a decent starting point to list commonly used alerts for various backends (Redis, PostgreSQL, host-level), since it was trained on a lot of texts discussing these. And since Prometheus excel in being very careful in their documentation, but in such a way that it’s very dense to humans, using an LLM to get a human-readable output turns out to be… not perfect, but at least usable.
I did find that its usefulness disappear the moment you ask for anything more complicated than “summarise to me what other people have been using to alert on this software,” which is why I’ll keep to the belief that LLMs will not “eat the world” like many people keep thinking of, but might be annoying enough people in the short term, that it would look like they did.
And finally, the last blocker was figuring out how to get stable backups. As I said, using the WordPress native backup feature requires paying for Jetpack Backup or one of the silly expensive bundles that include “AI” regurgitation (I can stomach using it as a tool to complete a task, not the idea of using it to poison the well of knowledge!)
Instead, I turned to my old friend Tarsnap which I always recommend to anyone needing a safe and reliable Unix backup. Using tarsnap meant that instead of having to worry about exports and completeness, I’m backing up the whole PHP application as well as the whole database storage. For the size of the blog as it is we’re still talking about less than $0.01, which is basically a rounding error.
Unfortunately, at first I was at a loss on how to make use of tarsnap on these ARM64 servers — the official path that Tarsnap recommend is building your own binary, but I didn’t want to have to compile it on the servers themselves, and while I did have a couple of ARM64 machines at hand, none of them are running CentOS 9 Stream which is what I ran on the servers in the first place. Thankfully, Stanislav came to my rescue by showing me how to use COPR, so I could fork the existing tarsnap repository making tarsnap available for CentOS 9 Stream on ARM64.
Spoilers: Docker and IPv6
As I already wrote, despite criticizing quite a bit the way enthusiast paint it, I’m not backward, and I obviously wanted IPv6 working not just to serve the blog but to have access to the underlying infrastructure. Turns out that there’s a lot of documentation out there scaring you from using IPv6 with Docker, but while it is definitely not the most obvious out of the box experience, there is nothing really that stops you from using them together.
Indeed, in many (but not all) cases, you don’t even need to bother with the experimental opt-in for ip6tables, as the only thing it is required for is for the backends to know the real address of the remote peer (without it, Docker uses the proxy for TCP connections, which is slower, but still feasible.) So the fact that we still carry on the meme that you can’t use IPv6 and Docker is a little bit on the annoying side even for me.
But this is a topic for its own blog post, particularly as I want to make sure to test a few more of the corner cases before I make a clear statement that things work Just Fine™, so for now consider it just a spoiler.
Moving Everything, and Making Sure
Once I knew I would be able to maintain my own infra with minimal oversight (which is not quite true, but we can accept the fiction for the time being, I’ll have more posts as we approach a lower-maintenance solution), I decided to bite the bullet and last Friday I reserved a new ARM64 server, installed CentOS Stream 9, and started re-creating the base system I needed to set up the blog.
The first hurdle was getting the data out of Gandi, though. As I found out, the emergency console access that Simple Hosting Service gives you do not allow SCP, because you need to press enter after login for it to work. The way I got around it is by using a very old terrible trick: netcat and tar. And since I couldn’t get an open port on the Gandi side, I had to open it on the new server side. It’s horribly insecure and wouldn’t recommend it to anyone, but it worked insofar as getting the data onto the right server.
I opted for dumping the MySQL database as SQL to load into MariaDB since Gandi was using the Percona builds, while the easiest Docker option I found was the official mariadb image. The WordPress install was tarred up together and just moved as a whole.
On the Docker side, I ended up with separate containers:
- a database container with MariaDB;
- a PHP-FPM container based on the WordPress images;
- a Caddy container to provide the user-facing web server (supporting HTTP 3!).
Additionally, of course, I had the usual selection of Prometheus exporter nodes, and eventually added a Redis container for the persistent object caching (a feature that was unavailable to me in Gandi!)
When I did the transfer, I thought I would test it first as a “beta” site — but that doesn’t seem to be very easy: all the links suddenly became absolute, and point at the old pages, so after ensuring that it would be sort-of working right, I switched the DNS, and waited for propagation to happen… and then, magic, the load times appeared to go significantly down! I’m not sure how much of this is due to a better connection, HTTP 3, or the general better performance of the server, but it was already a very nice improvement in my view.
Unfortunately a few hours after doing that, I could tell that something wasn’t right still. From time to time I could see a 500 (Internal Server Error) response being returned even for the main page. and when trying to use the admin interface I would get a database connection error.
This was particularly annoying to debug because by default you do not get any indication of what is wrong and when looking at the MariaDB log I could definitely see that no connection issue was detected on its side!
To make it possible at all to debug things, I ended up adding this to
define('WP_DEBUG_LOG', '/var/log/wordpress.log'); define('WP_DEBUG', true); define('WP_DEBUG_DISPLAY', false);Code language: PHP (php)
This allowed me to see the warnings and errors on failures not just on the named log file but on the stderr of the
php-fpm container image. Had I done this as the first thing, it would have been very easy to figure out, but I didn’t. Instead at first I was grasping in the dark, but had a guess that the problem would have to be with the resolution of the database name to the IPv6 address, and confirmed it when adding Redis, as that particular connection dropped more often, and would definitely complain about the resolution of the hostname.
I tried first keeping a
depends_on relationship between containers, but it didn’t help. Then I tried setting the address directly in the
/etc/hosts file, which needs to be done via Docker Compose options (
extra_hosts), and it didn’t seem to help either. Using the raw IPv6 address in the config file actually seemed to work, and since I already had to make it static for the
extra_hosts, then I could at least rely on that. It worked.
I have complained about this on the Fediverse, and Alessandro Lai came to my help, suggesting this could be a problem with the way Alpine resolves names, and pointing at a known issue. Unfortunately, the suggestion of using a trailing dot to ensure the hostname is considered absolute did not work either, and the following error keeps propping up from time to time when using a hostname rather than the raw IP address:
I think the best I can do for this is trying to reduce this to a test case I can run in a “lab VM” and then start debugging why the getaddrinfo appears to fail. For now, the static bare IP address works for me, and I’m not going to spend too much time for this.
Spoiler: PHP, Docker, and Caddy
Another topic that will likely feature in the blog in the upcoming weeks is how to correctly set up Caddy and PHP-FPM to work in Docker. I saw tons of documentation that suggest using Nginx, but I’m not quite sure about using Nginx again, and after all I know how to run Caddy in a Docker container already.
I think the important thing to understand is that if you want to forward requests from Caddy to FPM, you need to mount the PHP root at the same path within both containers, since FastCGI expects the handover to happen within the same filesystem (d’oh!)
Again, I’m going to write this in the future because I don’t want to throw half-baked tests to the world, rather than having a decently tested documentation for it. I’m also likely going to add monitoring which I’m currenty missing.
Switching to a completely self-hosted, and more powerful server didn’t solve the problem with the block editor delay when typing longer blog posts, suggesting this is a prolbem with the actual Gutenberg code, and something that I’m unlikely to be able to debug myself, or even pay Automattic to debug and fix for me.
On the other hand, I feel a lot more comfortable about the safety of my blog now. Both in terms of infra reliability and in terms of the ability to keep a backup of it I can recover.
It is possible I will regret the amount of time I’ll spend on the infrastructure maintenance, and decide to go back to look at someone to do this for me. After all, as I keep saying, the bakery is someone else’s oven, and while home baked bread is nice, after a long week working it’s just as fine to buy a pack of sliced bread at the store to use at breakfast.
For now, it appears the move went well, and even the annoying scraping from some Canadian OVH server that requested 32k URLs from the blog right after it changed IP didn’t even cause a blip on my monitoring. And this is all for £1 more than I was paying Gandi.