Diagonal Contributions

This is a tale that starts on my previous dayjob. My role as an SRE had been (for the most part) one of support, with teams dedicated to developing the product, and my team making sure that it would perform reliably and without waste. The relationship with “the product team” has varied over time and depending on both the product and the SRE team disposition, sometimes in not particularly healthy way either.

In one particular team, I found myself supporting (together with my team) six separate product teams, spread between Shanghai, Zurich and Mountain View. This put particular pressure on the dynamics of the team, particularly when half of the members (based in Pittsburgh) didn’t even have a chance to meet the product team of two services (based in Shanghai), as they would be, in the normal case, 12 hours apart. It’s in this team that I started formulating the idea I keep referring to as “diagonal contributions”.

You see, there’s often a distinction between horizontal and vertical contributions. Vertical referring to improving everything of a service, from the code itself, to its health checks, release, deployment, rollout, … While horizontal referring to improving something of every service, such as making every RPC based server be monitored through the same set of metrics. And there are different schools of thought on which option is valid and which one should be incentivised, and so it usually depends on your manager and their manager on which one of the two approach you’ll be rewarded to take.

When you’re supporting so many different teams directly, vertical contributions are harder on the team overall — when you go all in to identify and fix all the issues for one of the products, you end up ignoring the work needed for the others. In these cases an horizontal approach might pay off faster, from an SRE point of view, but it comes with a cost: the product teams would then have little visibility into your work, which can turn into a nasty confrontation, particularly depending on the management you find yourself dealing with (on both sides).

It’s in that situation that I came up with “diagonal contributions”: improve a pain point for all the services you own, and cover as many services you can. In a similar fashion to rake collection, this is not an easy balance to strike, and it takes experience to have it done right. You can imagine from the previous post that my success at working on this diagonal has varied considerably depending on teams, time, and management.

What did work for me, was finding some common pain points between the six products I supported, and trying to address those not with changes to the products, but with changes to the core libraries they used or the common services they relied upon. This allowed me to show actual progress to the product teams, while solving issues that were common to most of the teams in my area, or even in the company.

It’s a similar thing with rake collection for me: say there’s a process you need to follow that takes two to three days to go through, and four out of your six teams are supposed to go through it — it’s worth it to invest four to six days to reduce the process to something that takes even just a couple of hours: you need fewer net people-days even just looking at the raw numbers, which is very easy to tell, but that’s not where it stops! A process that takes more than a day adds significant risks: something can happen overnight, the person going through the process might have to take a day off, or they might have a lot of meetings the following day, adding an extra day to the total, and so on.

This is also another reason why I enjoy this kind of work — as I said before, I disagree with Randall Munroe when it comes to automation. It’s not just a matter of saving time to do something trivial that you do rarely: automation is much less likely to make one-off mistakes (it’s terrifyingly good at making repeated mistakes of course), and even if it doesn’t take less time than a human would take, it doesn’t take human time to do stuff — so a three-days-long process that is completed by automation is still a better use of time than a two-days-long process that rely on a person having two consecutive days to work on it.

So building automation or tooling, or spending time making it easier to use core libraries, are in my books a good way to make contributions that are more valuable than just to your immediate team, while not letting your supported teams feel like they are being ignored. But this only works if you know which pain points your supported teams have, and you can make a case that your work directly relates to those pain points — I’ve seen situations where a team has been working on very valuable automation… that relieved no pain from the supported team, giving them a feeling of not being taken into consideration.

In addition to a good relationship with the supported team, there’s another thing that helps. Actually I would argue that it does more than just help, and is an absolute requirement: credibility. And management support. The former, in my experience, is a tricky one to understand (or accept) for many engineers, including me — that’s because often enough credibility in this space is related to the actions of your predecessors. Even when you’re supporting a new product team, it’s likely its members have had interactions with support teams (such as SRE) in the past, and those interactions will colour the initial impression of you and your team. This is even stronger when the product team was assigned a new team — or you’re a new member of a team, or you’re part of the “new generation” of a team that went through a bit of churn.

The way I have attacked that problem is by building up my credibility, by listening, and asking questions of what the problems the team feel are causing them issues are. Principles of reliability and best practices are not going to help a team that is struggling to find the time to work even on basic monitoring because they are under pressure to deliver something on time. Sometimes, you can take some of their load away, in a way that is sustainable for your own team, in a way that gains credibility, and that further the relationship. For instance you may be able to spend some time writing the metric-exposing code, with the understanding that the product team will expand it as they introduce new features.

The other factor as I said is management — this is another of those things that might bring a feeling of unfairness. I have encountered managers who seem more concerned about immediate results than the long-term pictures, and managers who appear afraid of suggesting projects that are not strictly within the scope of reliability, even when they would increase the team’s overall credibility. For this, I unfortunately don’t have a good answer. I found myself overall lucky with the selection of managers I have reported to, on average.

So for all of you out there in a position of supporting a product team, I hope this post helped giving you ideas of how to building a more effective, more healthy relationship.

The importance of reliability

In the past seven years I worked at Google as a Site Reliability Engineer. I’m actually leaving the company — I’m writing this during my notice period. I’m currently scheduled to join Facebook as a Production Engineer at the end of May (unless COVID-19 makes things even more complicated). Both roles are related to reliability of services – Google even put it in the name – so you could say I have more than a passing idea of what is involved in maintaining (if not writing) reliable software and services.

I had to learn to be an SRE — this was my first 9-5 job (well, not really 9-5 given that SREs tend to have very flexible and very strange hours), and I hadn’t worked on such a scale before this. And as I wrote before, the job got me used to expect certain things. But it also made me realise how important it is for services and systems to be reliable, as well as secure. And just how much this is not fairly distributed out there.

During my tenure at Google, I’ve been oncall for many different services. Pretty much all of them have been business critical in one way or another — some much more than others. But none of them were critical to society: I’ve never joined the Google Cloud teams, any of the communication teams, or Maps teams. I had been in the Search team, but while it’s definitely important to the business, I think society would rather stay without search results than without a way to contact their loved ones. But that’s just my personal opinion.

The current huge jump in WFH users due to COVID-19 concerns has clearly shown how much more critical to society some of the online services are, that even ten years ago wouldn’t be found as important: Hangouts, Meet, Zoom, Messenger, WhatsApp, and the list goes on. Video calls appear to be the only way to get in touch with our loved ones right now, as well as, for many, the only way to work. Thankfully, most of these services are provided by companies that are big enough to be able to afford reliability in one form or another.

But at least in the UK, this has shown how many other services are clearly critical for society, but not provided by companies who can afford reliability. Online grocery shopping became the thing to do, nearly overnight. Ocado, possibly the biggest grocery delivery company, had had so much pressure on their system that they had to scramble, first introducing a “virtual queue” system, and then eventually taking down the whole website. As I type this, their website has a front page that informs you that the login is only available for those who already have a slot booked for this weekend, and otherwise is not available to anyone — no new slots can be booked.

In similar fashion online retailers, surgery online systems, online prescription services, and banks also appeared to be smothered in requests. I would be surprised if libraries, bookstores, and restaurant websites who don’t rely on the big delivery companies weren’t also affected.

And that had made me sad, and at least in part made me feel ashamed of myself. You see, I have been interviewing at another place, while I was looking for a new job. Not a big multinational company, a smaller one, an utility. And while the offer was very appealing, it was also a more challenging role, and I decided to pass on it. I’m not saying that I’d have made a huge difference for them from any other “trained” SRE, but I do think that a lot of these “smaller” players need their fair dose of reliability.

The problem is that there’s a mixture of different attitude, and actual costs, related to reliability the way Google and the other “bigs” do it. In the case of Google, more often than not the answer to something not working very well is to throw more resources (CPU, memory, storage) at it. That’s not something that you can do quickly when your service is running “on premise” (that is, in your own datacenter cabinet), and not something that you can do cheaply when you run on someone else’s cloud solution.

The thing is, Cloud is not just someone else’s computer. It’s a lot of computers, and it does add a lot of flexibility. And it can even be cheaper than running your own server, sometimes. But it’s also a risk, because if you don’t know when to say “enough”, you end up with budget-wrecking bills. Or sometimes with a problem “downstream”. Take Ocado — the likeliness it that it’s not the website that was being overloaded. It was the fulfillment. Indeed, the virtual queue approach was awesome: it limited the whole human interaction, not just the browser requests. And indeed, the queue worked fine (unlike, say, the CCC ticket queue), and the website didn’t look overloaded at all.

But saying that on-premise equipment does not scale is not trying to market cloud solutions — it’s admitting the truth: if you start getting so many requests at short notice, you can’t go, buy, image, and set up to serve another four or five machines — but you can tell Google Cloud, Amazon, Azure, to go and triple the amount of resources available. And that might or might not make it better for you.

It’s a tradeoff. And not one I have answers for. I can’t say I have experience with managing this tradeoff, either — all the teams I worked on had nearly blank cheques for internal resources (not quite, but nearly), and while resource saving was and is a thing, it never gets to be a real dollar amount that, as an SRE, you end up dealing with. While other companies, particularly smaller companies, need to pay a lot of attention to that.

From my point of view, what I can do is try to be more open with discussing design decisions in my software, particularly when I think it’s my experience talking. I still need to work actively on Tanuga, and I am even considering making a YouTube video of me discussing the way I plan to implement it — as if I was discussing this during a whiteboard design interview (since I have quite a bit of experience with them, this year).