Topically, given what the past couple of years gave us, it appears that sliced bread was invented in the early twentieth century for the purpose of increasing hygiene, and has thus been referred to as «untouched by human hands.» This is a very good analogy for the various “Zero Touch Production” initiatives that are often talked about at big tech companies, and of which I have been part at my previous employer, too.
Indeed, the title of this post should really be the sub-title of any one of those vision documents, because among other things, it provides a clear statement of what the “zero touch” refers to. But let me take a few steps back and explain myself in order, as otherwise it’s just going to get messy — and as I’ll show, I don’t like messy!
Randall Munroe Is Wrong (About Automation)
I said this before, and to be honest it’s not even my line – I don’t know if it came from Niall or Todd, but they did have it in one of their presentation at LISA a few years back in Boston – but I think it’s worth reiterating: Randall Munroe’s multiple comics about automation and related topics are fairly short-sighted in respect to automation in general. In part, this is because his content refers to the experience of an individual tinkerer, so they are not really applicable to more complex realities of businesses and big techs — but he seems to also focus almost exclusively to talk about the time measure of automation.
The problem with focusing on performance and time of an automation, is that not all time is the same, despite it looking the same from the outside. Time that a human needs to be focused on a task to complete is is a lot more expensive than the time a computer can spend working on the same task. This is similar to what Alec says about dishwashers: yes a dishwasher may take overall more clock time to clean dishes than if you just cleaned yourself — but you can be watching a movie, or typing a blog post, while the dishwasher is running! Actually no, don’t watch a movie while it’s running because then most likely the noise from the dishwasher is going to make it harder to enjoy the sound effects.
Take for instance release processes: manual release processes often have to deal with manually checking out source code, running test suites, ensuring that the generated artefacts are appropriate and still run, then eventually publishing somewhere. Often these processes rely on human eyes to catch size or behaviour changes. And builds may be boosted by having a locally checked out, up-to-date source repository with maybe partial build artefacts. If this process takes, say, three hours, once a month, from a perspective of time alone it might not be worth spending time automating it, particularly if it so often becomes a detour to fix issues with the involved documentation, code, or tools.
But one of the first items when automating a release process is to make sure that it can always run fresh. That means checking out the source code separately, and running it with no partially built artefacts. Even Autotools do that, when you use the make distcheck
automation! That can take time, particularly depending on the size of the source code. The three hours a month can then turn into four, or six, or even more! But if it’s automated, this is time that you don’t need to spend paying attention to the release process: it can run overnight, during the weekend, or while you’re having those pesky meetings that you would have preferred as email, but are the expected cost of dealing with messy humans. Computer time is cheaper than human time – not free, obviously: among other things it means running computers, but also still involves waiting – so if you can reduce the human time spent on the process by letting a computer do it more slowly, it’s usually a win.
But it goes on: as I said in my post about pre-commit
and code validation, automation can be consistent, which humans are often not. Doing a clean fresh build is important, and I would even say fundamental, but why stopping at the source code and artefacts? In a time of easy containers, there’s a reason why a lot of release automation just starts from a clean slate of a new distro image, and sets everything up from there: it no longer relies on the customization that the releasing person applied to their setup, removing (or at least reducing) the risk that a local environment change causes a difference in the released code. Consistency!
Automation Is Your Gym
I have to say I feel a bit out of place in this analogy given that I’m the last person to talk about gyms, as the only ones I visit regularly are the Pokémon Go ones. But if you compare the work of setting up automation to the process of going to workout in a gym, you may be getting a view closer to mine and that of many other Release, Site Reliability, or Production Engineers.
You don’t go to the gym to train because it will save you time: you go because you want to be fit, so that you can keep doing… something. Lifting, running, walking, … it doesn’t really matter. A teammate and good friend once said that he’s going to the gym to lift because he wants to be capable of still carrying his own luggage when he’ll be retired and travel across the world. Automation is the same: you don’t write automation to run it once in a blue moon, and then figure out that half of the source is broken and the tools changed their flags, you run it all the time so that when you do need to make a release, you just need to pick the last one that succeeded… you already have the artefacts you need!
That’s why you hear often the motto of “Automate what hurts”: if your release is automated, and you run it daily, you’ll know immediately if something broke, either in the environment or in the source itself, you won’t find that out when you’re meant to actually cut the release, and you find yourself scrambling because the usual three hours process is now a twelve hours fixathon to make sure you can cut a release.
Of course, just like at the gym you should be careful not to overdo things: if you were to automate your deployments, and at each deployment all of your customers lose their progress on orders, you will notice quickly… but you will also have a lot of disappointed users who might just decide that your store is not their choice anyway. Or maybe you deployed your game servers last week, and people just stopped trying to beat the boss that you set as immortal.
Just like my PE teacher taught us in junior high to check our blood rate after exercise, automation needs monitoring, to make sure that things are not getting worse release after release. Again, this is something that is easier to establish if you have frequent releases and deployments, as you can notice a steady yet subtle change in metric more often between releases that shouldn’t have significant changes in them, rather than from significant releases that might be bringing a completely revamped UI as well as significant change in functionality.
Also, shameless plug for you to go ahead and refer to my Testing is Like Onions post for further details on this topic.
So, instead of making one release every month, you may end up building a release candidate every day, or at least once a week. And maybe you can start promoting one every other week, until you build the confidence (and the validation tools) to just run it whenever you need to, or whenever you can.
Back To Sliced Bread — I mean Zero Touch Production
With the automation context out of the way, let me clear out the first misunderstanding that I have seen before. Zero Touch in the context of Zero Touch Production does not mean magic. The misunderstanding happens because we have “one-touch provision” or “one-touch deployment” concepts that are also often related to automation, and the “Zero Touch” makes it sound like we’re trying to have them happen without the user even asking. That’s not it.
The “Zero Touch” part is closer to the original reference to sliced bread: have a production that is untouched by human hands. That basically means that you should not be able to go and make changes to production in your own way. It means that for each possible change you’re expected to be able to make, there’s a process that does it for you, following rules and processes that would be too tedious for a human to stick to, and too easily memorized — memorized processes are bad processed: when the tools, requirements, or even timing changes, operators tend to not check if the process changed, and apply what they remember off the top of their head.
So in a Zero Touch Prod utopia, you may have your testing, release, and deployment pipelines set up so that you just land your changes to the source repository, and the next day (or the Monday after — I have opinions about weekend deploys that are not widely shared) they are deployed to production (or staging, depending on how fast you move.) And the same if you need to move the service into (or out of) a new region: you change the source of truth, and something will make the change for you following the expected process, safeguards, and windows.
The “windows” part is something else I want to reiterate: just because you can deploy multiple times a day, it might not be a good idea to. There may or may not be disruption on a release or deployment, or there might be a better time to run certain processes. Just because we’re talking about automation, it does not mean that the automation needs to be constantly running!
But why doing all of this? What’s the thing with human hands that both Ward Baking Company and me hate so much? Well, hygiene is the keyword here, and it’s not unrelated that a lot of SRE type folks talk about “production hygiene” when it comes to following certain guidelines on the configuration of services into production.
If you let humans run the processes instead of automating them, there’s a chance that the same process run by two humans will have different outcomes. Maybe the tools rely on the timezone settings, and the two are not in the same one (or maybe they are and DST just changed — seriously, stop running production in Pacific Time, you fools!) or maybe the language settings on their machines are different and floating point is formatted with a comma instead of a period (this one happened for real to me when I was working on LScube.)
There’s also a chance that the same process ran by the same human twice doesn’t have the same outcome because they decide they can skip a couple of steps as they just did the same process yesterday and the setup is still valid — and maybe it isn’t. Or maybe they decided not to read the docs again, because they remember the steps, and instead forget a subtle one in the middle.
Worse, when the action is not even a process, but a manual “let me tweak the amount of memory reserved for that task until this spike is over” (whether by changing the reservation, or by requesting an “emergency loan” if your system includes that), you now have a production that does not match the source of truth, which might or might not be obvious to notice (not everything can be compared between running state and intended state, particularly if modifying a production config file is persistent).
All of these “human touches” are liabilities, both for companies, and for the individuals involved. For a company, beside the risk of mistakes happening, there’s an additional risk if the engineers who should be running emergency procedure are too scared to make mistakes, they may just not run the procedure. For the individuals, there’s the issue of increase liability (not monetary, but reputational), as well as the risk that having to check, double-check, and triple-check each step might take a two hours long process to be a six hours one!
In addition, once you actually reach your goal of zero touch production, you can start safeguarding access, leaving “break glass in case of emergency” knob to provide that access, similarly to the sudo security model: you wouldn’t want to run your desktop as root, but you can still access all of its power if you need it.
The Journey Is Complex, Meet Your Friends There
One of the things that I want to make very clear is that Zero Touch Production is not something that can be achieved overnight. In reality, it’s mostly an aspiration, like most of the tales in the SRE Book.
I worked on a number of projects over the years that have been labelled more or less successfully as “Zero Touch Production” — for the most part this is a catchy name for projects involved in building automated and reproducible processes. I even have used it as marketing spin explicitly, to take a project that I thought was worthwhile, such as automating the turn-up process of a complex service, and have leadership value it more highly, because the buzzword was being thrown around in meetings.
The truth is that most likely, you’ll have problems that cannot be fixed without touches from human hands — projects labelled as Zero Touch Production should be there to make this less common. They are about building tools that provide you with equivalent, low-privilege access.
If you have heard of “immutable containers”, that is a similar, yet not identical idea: if you don’t allow configuration to be persisted in the form of editable configuration files, but only in passed in via environment variables or by rebuilding the code, you have a lot less risk that your systemd depends on a one-off fix to the config file from the oncall who was awake when the deployment nearly took down your entire company. But again, the “immutability” is often aspirational. Most of the systems that are declared “immutable” come with a way to self-configure based on an external “state storage” — I think outside the bubble this is usually Zookeeper, but in general it’s some kind of consensus broker. It’s like calling services “stateless” simply because the database is accessed over the network, rather than as a local file.
What I do think is that Zero Touch Production is overall a general goal to have, but not one that can easily be achieved overnight or in the span of a quarter, half, or year. If you do think it’s a good time to have this discussion within your organization, consider it as a banner to unify a number of different, and sometimes disparate, initiatives. In my experience it does a great good at getting people working on widely different areas (such as monitoring, deployment automation, release automation, testing, permission handling, etc.) to feel like they are working towards a common goal, and one that, if squinting the right way, is exactly what they wanted to get done for a long time!