Please note, this blog post talks about concerns that are mostly known about in big companies (mainly Big Tech), rather than in smaller realities or in FLOSS. While some of the issues I’m going to be talking about have parallels in some of these, you probably would be lost wondering what it is that I’m talking about if you never experience those massively bigger reality. I sometimes envy you.
People are busy. It’s a truism, but if you work supporting other teams, you know that this is indeed the case — you are busy, and the teams you support are busy. The things you want to happen are not happening, because everyone is prioritizing for something, and their priority might not align with your priority. Even when that priority is not simply the priority of performance management, to make sure that they stay employed, and possibly get promoted.
A common suggestion to deal with this is to make it very clear, when you demand another team to prioritize the work that unblocks you, that you’re both going to win from it. This might sound obvious, but sometimes big enough companies have teams that are at odd with each other and the win for one can be perceived as a loss for another. Sorting all of that out goes well beyond my abilities, and also the topic of this post.
But there is one particular type of requests (or, in management speak, “asks”) that can be generalized a little bit more than that: migrations. Migrations take all shapes, but there are a few fixed points around them: they are reactive work due to one or more centralized teams requesting changes to the teams using their product or infrastructure, they require understanding of how said product or infrastructure is used within each internal customer, and they carry together a significant amount of risks to product or infrastructure that are in active usage, which makes things quite more complicated in terms of execution.
A note here about what I call frivolous migrations. As noted above, a migration is a complex request to a number of customer teams. Particularly when the migration affects a widely used product, forcing a migration, no matter how well planned, is a risky endeavour. If the migration is not bringing substantial upside for the risk it entails, higher technical leadership should get involved to have it being reconsidered.
On a larger scale, you can see how the IPv6 migration is such a stalled migration: there are little upsides to migrating, and a significant risk by executing it. It is tying up a lot of resources (across the world, rather than a single organization) supporting both the state before and after the migration, with no real path forward visible yet — note that this applies to the global migration; within specific organizations, the migration might actually solve problems and reduce resource constraints, it’s just globally that this is a problem!
I have also already written about a similar type of frivolous migrations: rewriting software in different languages for the sake of the language choice itself. When there is no predefined benefit of the rewrite, you would expect more senior people to ask the hard questions to avoid wasting effort not just from the team doing the rewrite, but also from the possible internal customers who have to deal with the rewrite: either because of newly introduced bugs, or behaviour differences.
Leaving aside these frivolous migrations, there are plenty of good reasons to migrate between an older and a more modern system. You may be replacing a more complex system with a simpler, more reliable one. Or you may be replacing an inflexible system written early in the days of an organization with something that has a more dynamic architecture. Sometimes, the migration is forced by external factors, including regulatory compliance, costs, or other business requirements.
Knowing that there are so many migrations that might be needed, sometimes even at the same time, with tight schedules, is where the concept of “attention budget” piqued my interest. The idea behind it is that people can only juggle so many requests at one time, and every time a migration comes their way, it needs to fit into this “attention budget.” Even if a migration is, from your point of view, a minimal amount of work, having to add that to their list of priorities is a significant request, and eventually your internal customers will start pushing back against it.
So, what are my recommendations to deal with a necessary migration? Well, first of all is to try to make it as invisible to the customers as possible, either by making sure that there are no semantic changes with the migration (for instance of an API), or by making changes yourself for the customers themselves.
In practice, this means that these migrations should be announced with enough advance, with a defined plan, to make sure that they don’t end up stopping everyone in their track. They need to happen in waves, with early waves for testing available generally, and a “laggards” wave for customers who are too critical, overworked, or about to be decommissioned. These waves should never go 0%→100%!
Personally, the most annoying of migrations I worked on was one where the only way to enroll in a complete early wave testing was if your service was already running on a specific region. The migration required at least three different moving pieces to be tested together, but only one region had the ability to enable all three — and if your service didn’t already have any footprint in that region, you had nowhere to test all of the components fitting together.
The other recommendation I have is to make it very clear to the customers what they do or do not need to care about in a migration: dashboards, notices, and automated verification tools are my usual go-tos for this work. If your migration notice tells the customers “If you’re relying on feature X from product Y, you will be required to migrate to product Z by end of next quarter”, the first question the team will have is “Do we use feature X?” and if you don’t have an easy way for them to know, you’re overloading both them and yourself.
For this to be even possible, you need to have a way to detect boundaries of a product. This is far from an universal, clear cut option: it is rare for a team to look after a single service, and even within that service for it to be composed by a single job. Even in the world of monoliths, you would expect at least some composition of web servers and databases, but in the world of Big Tech’s microservices, this complicates a lot more. These services end up having to be then provided some sort of ownership to map them to responsible teams. Usually this includes attaching them to oncall rotations of sorts, but it might also just be a matter of tagging services together or to provide a common group name.
As organizations grow, and proper access controls become a bigger concern, it is very well possible that you, as infrastructure provider, have no access to know if another team’s usage of your product is indeed in need for migration or not. In these cases, you should consider building tools that can be pointed at a service, codebase or datastore to be able to tell what is and is not in use by them.
This is where my recommendations for migrations clash a bit with what many software engineers believe: many (if not most) of the times, migrations are done to be able to decommission an old codebase, one that has accrued too much technical debt to be maintained, or that turns out to be too expensive to keep running. Going and making changes to the “old” system is considered unglamorous by many, and is seen as a career dead end for some junior people. I don’t agree to this: particularly when putting together a migration plan, working on the old codebase to increase visibility is extremely important: you want to know who is using the features you’re going to get rid of, whether via logs, counters, or other observability tools.
If a system was designed with having observability built in from the beginning, this is obviously easy. If it wasn’t, it might require some engineering work just to be able to tell who is using which feature. Some of my worst migration nightmares have been related to how a new piece of software was built to migrate semi-transparently, but in doing so completely removed the ability to tell whether a particular component was or was not migrated. Take a leaf out of Python’s book: explicit is better than implicit.
Speaking of Python, when I left my previous bubble, the migration from Python 2 to Python 3 was still in full swing. This migration was mostly required by external factor (Python 2 being EOL’d), and it ended up tied together with a number of other work items. Getting some foundational core libraries ported to Python 3 was originally considered a waste of resources, particularly as many of these had new, more modern implementations that would wouldn’t need to be ported.
Unfortunately, it turned out that these new core libraries had significant semantic differences compared to the old ones, and porting them across was a difficult task that required a non-negligible amount of engineering. So eventually, the whole migration effort came unblocked when those core libraries were ported to Python 3 (mostly, but not only, by me) despite them being obsolete and deprecated.
In my experience, the best tools are those that can both report and act on those reports. Depending on what the migration is about, having some kind of flag or switch that can be programmatically changed (and possibly rolled back) is the best experience for the customers. When programmatically automating the migration is not an option, a good alternative is to at least provide copy-paste friendly commands or links.
In the Python 3 migration I spoke about above, I eventually built a script that not just identified blockers in the migration (such as the usage of still unported libraries) but also applied a number of transformations to the code, making it Python 3-only. This was an intentional feature: once all of the downstream users of a library were not just Python 3 compatible, but more importantly no longer compatible with Python 2, the library itself didn’t need to maintain compatibility!
This does mean that designing migrations takes a lot more work than building a new system to replace the old one. The team pushing for such migration should, in my opinion, consider how much cost they are about to spread around across how many teams, and plan to invest the same amount of resources in making it cheaper and faster for the internal customers: this not only reduces the risk of disrupting the priorities of the rest of the organization, but also builds reliability keystones that will make further migrations cheaper in the future.
Once again, this is just my experience, but I found that the tools that might appear to be built for the sake of one single large scale migration are easily used as a basis for the next, and that if the originating team spends time understanding where their customers’ pain points are going to be with the migration, the new system ends up being more solid.