It’s been a spacey week for me. Since last Thursday I read “The Martian” then watched the movie with Matt Damon playing the guy they left stranded on Mars. Over the weekend there was also a fascinating documentary on the Apollo missions featuring interviews with Apollo astronauts.
Writing reliable software (read software that doesn’t crash) is a big deal, especially if you’re in space. At least 90% of the effort of designing and writing reliable software is working out the ways it’s possible to screw it up and dealing with them.
But even more important is recovering from all the events you didn’t anticipate. The Apollo 11 mission that landed the first man on the moon had an unexpected computer error on descent. I’m guessing here, but I imagine it’s quite important your computer doesn’t lock up when you’re trying to land a spaceship. But the software coped with it admirably despite it being completely unanticipated.*
Margaret Hamilton lead the team who wrote the software for the Apollo missions and handling error conditions became a fundamental part of how she structured her approach. She’s a fascinating person to read up on while you’re researching the computer tech used on the Apollo missions (which I’m sure you’ll go off and do now).
Proximity to Developer
I discovered a good test for reliable code is how well it performs as a function of how far away the developer is. If the dev is standing looking at it, chances are it’ll be fine. If they move to the next room, there’s more chance of a failure. If they deploy it on a different continent (or planetary body) then… all bets are off.
I’ve written mission critical code for various companies including the NZ TAB where downtime causes millions of dollars of lost turnover, Weta Digital where downtime causes James Cameron to shout at you, and oddly enough a piece of software I gave away for free which ended up being used in an emergency medical department in the USA who then asked me to guarantee it wouldn’t kill anybody on Y2K.
When Beany launched in 2013 I was pleasantly surprised it stayed up and functioned well even while I slept! People still log in at all times of day and night, update their accounts and purchase things. All by themselves! It’s been a very reliable service with no appreciable downtime at all so far (touch wood!).
We spend a lot of time making sure beany is secure and stable, but we stand on the shoulders of giants who have made computer systems and software ever-more reliable by learning under pressure.
* Margaret Hamilton says the “1202 error” condition was caused by incorrect documentation saying how a switch should be set (source Wikipedia). Buzz Aldrin says he explicitly wanted both the landing system and the rendezvous systems running simultaneously which wasn’t what the MIT developers had expected (source interview on “In the Shadow of the Moon” documentary). The end-user is always right!