Last month’s AWS outage caused the usual amount of stir, the classic amount of complains about cloud being just someone else’s computer, a number of discussions about multi-region deployments, and the consideration of cost-benefit analysis, and so on. Among all of the usual amounts of noise, an infosec professional that shall remain unnamed, amplified a barely explained report from someone complaining that their thermostat failed during the outage, causing temperature in their residence to drop, killing the fishes in their aquarium. The amplification (in form of retweet with the tagline «Death by AWS outage») got under my skin even before it becoming even more tasteless given the death of six people in the collapse of an Amazon warehouse.
First of all, as I said on Twitter at the time, I don’t believe this ever happen. I find it unprofessional for someone working in infosec to use such strong words to amplify something that they can’t possibly verify. As many of the people in the replies to both tweet pointed out, all major brands of thermostats would not behave like this: they all work fine without a network connection for an extended period of time, if they are already configured, because the reasonable solution for a device such as a thermostat is to not change anything in its programming if it doesn’t get told explicitly.
I can’t exclude that further bugs in the logic make the situation “totally offline” work better than “online, but AWS is gone” — I have seen code for applications that behave exactly like that before. And I have written a couple of weeks back of how a provider’s DNS can stop you from configuring your Nest. So there might be a thermostat that does behave exactly like described — but after checking back a few days later, nobody suggested a model and make, to figure out which manufacturer has been so careless as to not maintain local programming.
But there’s another exercise I can look into myself, which is figure out which failure mode my thermostat has, and what I can do to mitigate more of them than I have right now. Obviously, my thermostat is not depending on the cloud in any particular way, since it’s an ESPHome-based system that is used together with Home Assistant in a local network, so does that mean that it does not suffer of any crippling design mistakes or limitations right now? I wish!
A thermostat, in its simplest of forms, needs three things: a temperature sensor, a set temperature, and an actuator. These can be simple and electromechanical, or they can be complex and “smart”. The more complex, the higher the risk of a fault. And when you have risk in a complex solution, you can only do one thing: add mitigations (and thus complexity) until the compromise satisfies your constraints.
In the case of my custom thermostat, actuation and set temperature (technically, two set points) are stored locally by ESPHome. They are also restored in case of power cut, so those components shouldn’t be much of an issue. The sensor, on the other hand, is a different story.
The main reason why I decided to reverse engineer the thermostats was that the original position of the sensors was not really useful to me: the effective temperature sensing happened on the thermostat itself, which seemed to always keep it a couple of degrees higher than I would have said, plus the thermostats are placed by the door, making it a lot less representative of the temperature experienced in the room, particularly when it comes to the office I spend pretty much the whole day in, or the living room with the sensor just by the kitchen area.
So what I’m doing now, I’m exposing an average temperature between two sensors in both those rooms. This allows accounting for the horrible draught in my office, as well as the difference in temperature between kitchen and sofa. But it does that at a cost: the average is calculated by Home Assistant, so if the Home Assistant host is not online, the thermostats don’t get a new reading of the temperature. This is geerally an acceptable option, given that the host is fairly reliable, and that we only rarely need to have the HVAC running, as the flat is fairly mild by itself, but it is not a complete lack of faults in the design.
I have thought about this, but not really written much on the topic so I thought it would make an useful blog post now. Among other things, because the way I thought about this is by thinking of how other solutions, particularly big name commercial projects, solved a similar constraint. If you look at the post linked above about the Nest Thermostat E by Google, you may be able to tell that their design does indeed defend against some of the problems we’ve been discussing.
The Thermostat E is comprised of two components: the desk “thermostat” that includes the temperature sensor and allows configuring the set point, and the actuator that goes on the wall, which is what actually switches the heating system on or off. These two components talk to each other over Bluetooth LE: this means that they don’t depend on either the Cloud nor the local network to be up and running for them to work. This is indeed a good design, because it allows not just the Internet connection going down, but also a protracted lack of WiFi due to, for instance, the router breaking on you. Indeed, you can disconnect a Nest thermostat off the Internet right after setting it up, and it will work just fine, not unlike the ESPHome one I have built, with the ability to move your temperature sensor to alternative locations.
It shouldn’t be that difficult then to do the same in ESPHome, so that the sensor is read directly over BLE, instead of depending on Home Assistant. The averaging can be done directly on the configuration, so that the only thing that involves Home Assistant is pushing the new set points. Well, at least on paper. The problem with this is that first of all, not all of the temperatures I’m using for the averaging are coming from the CGG1 sensors, which would indeed be readable from ESPHome. Both in my office and in the bedroom, I’m relying on the Dyson purifiers’ temperature sensors as the secondary reading (they are positioned far from the CGG1s making them very useful to gauge the trend in the room), and those are not easily accessible without going through the Home Assistant installation.
The second problem with this is more malleable: it looks like the current configuration I’m using to get the BLE sensors reading is not compatible with the Over-The-Air update of ESPHome. Whenever I try using the OTA to update my bridge device, it fails midway through uploads. At first I thought that the problem was with the number of BLE sensors I was trying to keep up with, but that does not matter: I added a single sensor to the thermostat in the office, and while it reported correctly, it failed OTA just as badly as the bridge with all of the sensors in it. I should probably engage with ESPHome upstream to figure out why this is happening, and see if there is any solution that can be applied to avoid it in the first place as then it would be much easier to just read at least some of the sensors directly wiht the thermostas, and no longer even need the bridge device, as each of the three rooms can take care to report the closest sensors.
Do I have any remaining options to mitigate the risks? Well, yes. as I reported in my initial investigation on the bus, I noted that the actual HVAC engine reports its own sensed temperature as part of the reply to the panel. There’s no reason why we shouldn’t be able to make use of this temperature in case the Home assistant provided sensor failed to update. It wouldn’t be the most useful temperature to have, but it would prevent some horrible failures in case the temperature suddenly increases or decreases away from the set points. Why did I not implement this yet? Well, mostly because it’s a very small case at this point. It would require the network to have gone down entirely, and for the right decision to be “turn the heating on” or “turn the aircon on” rather than “turn the HVAC off,” and that sounds very unlikely.
This is not to say that there are not real cases where that would be the case, after all I configured the thermostats with a high set point for the aircon for when we’re not even in London, but not told it to stay constantly off. During a heat wave the flat can actually get very, very warm, which means there are things, such as my insulin, that would be at risk. Most of the insulin is in the fridge anyway, but it helps to make sure that we don’t suddenly end up with over 30C in the flat.
But there’s a limit on how much complexity I’m accepting for a situation that can still happen without me being able to do anything else about it: if the power was to go out, it doesn’t really matter what the thermostats would be thinking: the HVAC would still stay turned off, and the flat would be left at the mercy of the weather. Thankfully, it’s insulated well enough that we don’t have significant jumps to cold or hot, except for heat waves and other similarly exceptional cases.
For a thermostat failure to cause the death even of fishes, on a two hours outage, it sounds like the thermostat was controlling heating in an open space or a badly insulated house that could suddenly get close to freezing temperature without the action of the climate control. If that was the case here, I definitely would have looked at a stronger mitigation in place for this class of problems, and possibly included my original design of a cut-off to the original control panel and its temperature sensor as a fallback.
There are more issues with my thermostats that could be taken care of with a bit more complexity and some iteration. I have considered adding some basic status reporting and interaction to the thermostats, allowing for instance to turn it off without having to go through home Home Assistant, or even allowing configuring the set points with a small display and two buttons. But that does not appear to be needed a the moment, so I have not invested the time (and money, significant money would be involved in redesigning it that way) to address a problem that is real, but very low in the list of my priorities.
What this post is meant to tell you is that instead of taking any unverified statement that strengthen your position at face value, it’s worth thinking about problems and solutions, costs, and benefits, and complexity and fault tolerance as nuanced problems. And instead of slagging smart technology users who might have reasons beyond your understanding of the problem, or who might be trying to help, you may want to provide them with alternative solutions, like I keep saying.
So to close this off, a huge thank you again to ESPHome and Home Assistant for having built not just two incredible platforms for user-centric Smart Home solutions, but also for pushing in the right direction for manufacturers to follow, too. And for admitting that sometimes, even though Cloud is not a perfect solution, it’s okay to do something in a less than optimal way because it is fun.