Incidents as Accidents

Jesse Singer’s book, There Are No Accidents, identifies accidents as having two primary components - human error(s) and dangerous conditions. Fallibility is part of humanity. We are imperfect and cannot change that as much as we might want. It is when an error encounters a dangerous condition that serious consequences occur. We can choose to address dangerous conditions if we are so inclined.

Let’s see an example.

Accident - a pedestrian is hit by a car while crossing a divided highway

Possible errors

Driver is driving too fast for conditions
Driver is driving distracted driving
Pedestrian misjudged the available time to cross

In the book, the scenario is lifted from the real world. Where’s the dangerous condition? It turns out that the pedestrian is crossing the divided highway at that location due to several reasons:

there is a bus stop on one side of the highway
there is a large apartment complex on the other side of the highway
the closest protected crossing is 1/4 miles one way, 1/2 miles round-trip

Even on a day with good weather and people with ample time, some will choose to travel directly instead of walking one-half mile. If you have bad weather, tired people after a work shift, or people in a time crunch, of course more will test the fast path directly across the road.

Now, what does all this have to do with software incidents? We ran a fairly standard incident review process - root cause analysis, 5-whys, cost assessment, durable changes to prevent future occurrence. It always felt that the result ended up very micro-focused, but it wasn’t clear that we ever improved the systems that brought us to this point. We made a small change. In addition to all the aforementioned things, we strated treating incidents in the same manner as the book does accidents and explicitly naming the dangerous conditions.

In addition to work items fixing bugs, adding better observability, and the like, our incidents had us talking about how lack of dev/prod parity, unreliable developer environment setup, poor communication, and long feedback cycles were dangerous conditions in our environment. The next thing to happen was people wanting to know what we were going to actually do with this information? They wanted to see action items driven from this exercise. In my opinion, practice identifying these situations is an action. But that’s a story for another time.