Planets, Clouds, Python

Half a year ago, I wrote some thoughts about writing a cloud-native feed aggregator. I actually started drawing some ideas of how I would design this myself since, and I even went through the (limited) trouble of having it approved for release. But I have not actually released any code, or to be honest, I have not written any code either. The repository has been sitting idle.

Now, with the Python 2 demise coming soon, and me not interested in keeping around a server nearly only to run Planet Multimedia, I started looking into this again. The first thing that I realized is that I both want to reuse as much code exist out there as I can, and I want to integrate with “modern” professional technologies such as OpenTelemetry, which I appreciate from work, even if it sounds like overkill.

But that’s where things get complicated: while going full “left-pad” of having a module for literally everything is not something you’ll find me happy about, a quick look at feedparser, probably the most common module to read feeds in Python, shows just how much code is spent trying to cover for old Python versions (before 2.7, even), or to implement minimal-viable-interfaces to avoid mandatory dependencies at all.

Thankfully, as Samuel from NewsBlur pointed out, it’s relatively trivial to just fetch the feed with requests, and then pass it down to feedparser. And since there are integration points for OpenTelemetry and requests, having an instrumented feed fetcher shouldn’t be too hard. That’s going to probably be my first focus when writing Tanuga, next weekend.

Speaking of NewsBlur, the chat with Samuel also made me realize how much of it is still tied to Python 2. Since I’ve gathered quite a bit of experience in porting to Python 3 at work, I’m trying to find some personal time to contribute smaller fixes to run this in Python 3. The biggest hurdle I’m having right now is to set it up on a VM so that I can start it up in Python 2 to begin with.

Why am I back looking at this pseudo-actively? Well, the main reason is that rawdog is still using Python 2, and that is going to be a major pain with security next year. But it’s also the last non-static website that I run on my own infrastructure, and I really would love to get rid of entirely. Once I do that, I can at least stop running my own (dedicated or virtual) servers. And that’s going to save me time (and money, but time is the most important one here too.)

My hope is that once I find a good solution to migrate Planet Multimedia to a Cloud solution, I can move the remaining static websites to other solutions, likely Netlify like I did for my photography page. And after that, I can stop the last remaining server, and be done with sysadmin work outside of my flat. Because honestly, it’s not worth my time to run all of these.

I can already hear a few folks complaining with the usual remarks of “it’s someone else’s computer!” — but the answer is that yes, it’s someone else’s computer, but a computer of someone who’s paid to do a good job with it. This is possibly the only way for me to manage to cut away some time to work on more Open Source software.

“Planets” in the World of Cloud

As I have written recently, I’m trying to reduce the amount of servers I directly manage, as it’s getting annoying and, honestly, out of touch with what my peers are doing right now. I already hired another company to run the blog for me, although I do keep access to all its information at hand and can migrate where needed. I also give it a try to use Firebase Hosting for my tiny photography page, to see if it would be feasible to replace my homepage with that.

But one of the things that I still definitely need a server for is keep running Planet Multimedia, despite its tiny userbase and dwindling content (if you work in FLOSS multimedia, and you want to be added to the Planet, drop me an email!)

Right now, the Planet is maintained through rawdog, which is a Python script that works locally with no database. This is great to run on a vserver, but in a word where most of the investments and improvements go on Cloud services, that’s not really viable as an option. And to be honest, the fact that this is still using Python 2 worries me no little, particularly when the author insists that Python 3 is a different language (it isn’t).

So, I’m now in the market to replace the Planet Multimedia backend with something that is “Cloud native” — that is, designed to be run on some cloud, and possibly lightweight. I don’t really want to start dealing with Kubernetes, running my own PostgreSQL instances, or setting up Apache. I really would like something that looks more like the redirector I blogged about before, or like the stuff I deal with for a living at work. Because it is 2019.

So sketching this “on paper” very roughly, I expect such a software to be along the lines of a single binary with a configuration file, that outputs static files that are served by the web server. Kind of like rawdog, but long-running. Changing the configuration would require restarting the binary, but that’s acceptable. No database access is really needed, as caching can be maintained to process level — although that would men that permanent redirects couldn’t be rewritten in the configuration. So maybe some configuration database would help, but it seems most clouds support some simple unstructured data storage that would solve that particular problem.

From experience with work, I would expect the long running binary to be itself a webapp, so that you can either inspect (read-only) what’s going on, or make changes to the database configuration with it. And it should probably have independent parallel execution of fetchers for the various feeds, that then store the received content into a shared (in-memory only) structure, that is used by the generation routine to produce the output files. It may sounds like over-engineering the problem, but that’s a bit of a given for me, nowadays.

To be fair, the part that makes me more uneasy of all is authentication, but Identity-Aware Proxy might be a good solution for this. I have not looked into that but used something similar at work.

I’m explicitly ignoring the serving-side problem: serving static files is a problem that has mostly been solved, and I think all cloud providers have some service that allows you to do that.

I’m not sure if I will be able to work more on this, rather than just providing a sketched-out idea. If anyone knows of something like this already, or feels like giving a try to building this, I’d be happy to help (employer-permitting of course). Otherwise, if I find some time to builds stuff like this, I’ll try to get it released as open-source, to build upon.

Backing up cloud data? Help request.

I’m very fond of backups, after the long series of issues I’ve had before I started doing incremental backups.. I still have got some backup DVDs around, some of which are almost unreadable, and at least one that is compressed with the xar archive in a format that is no longer supported, especially on 64-bit.

Right now, my backups are all managed through rsnapshot, with a bit of custom scripts over it to make sure that if an host is not online, the previous backup is maintained. This works almost perfectly, if you exclude the problems with restored files and the fact that a rename causes files to double, as rsnapshot does not really apply any data de-duplication (and the fdupes program and the like tend to be .. a bit too slow to use on 922GB of data).

But there is one problem that rsnapshot does not really solve: backup of cloud data!

Don’t get me wrong: I do backup the (three) remote servers just fine, but this does not cover the data that is present in remote, “cloud” storage, such as the GitHub, Gitorious and BitBucket repositories, or delicious bookmarks, GMail messages, and so on so forth.

Cloning the bare repositories and backing those up is relatively trivial: it’s a simple script to write. The problem starts with the less “programmatic” services, such as the noted bookmarks and messages. Especially with GMail as copying the whole 3GB of data each time from the server is unlikely to work fine, it has to be done properly.

Has anybody any pointer on the matter? Maybe there’s already a smart backup script, similar to tante’s smart pruning script that can take care of copying the messages via IMAP, for instance…